## Warning: package 'GGally' was built under R version 4.4.3
## Warning: package 'reshape2' was built under R version 4.4.3
## Warning: package 'broom' was built under R version 4.4.3
The dataset used in this project consists of 651 randomly sampled movies produced and released before 2016. The data were collected from publicly available sources, primarily the Rotten Tomatoes and IMDB APIs. These APIs provide comprehensive information about each movie, including variables related to cast, genre, critic and audience scores, and other attributes.
Because the data were collected through a random sampling method, we can reasonably generalize our findings to the broader population of movies released prior to 2016. However, since this is an observational dataset, we should be cautious in making any causal claims based on our analysis.
Some variables, such as actor1 to actor5,
are informative but not directly suitable for statistical modeling
without significant transformation. As part of the data preparation
process, we will make decisions about which variables are meaningful for
modeling, and potentially restructure or exclude certain variables to
best answer our research question.
Additionally, attention will be given to issues such as multicollinearity and missing data, as these can affect the validity of our model and interpretations.
What factors significantly influence the audience scores of movies?
## tibble [651 × 32] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num [1:651] 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
## title title_type genre runtime
## Length:651 Documentary : 55 Drama :305 Min. : 39.0
## Class :character Feature Film:591 Comedy : 87 1st Qu.: 92.0
## Mode :character TV Movie : 5 Action & Adventure: 65 Median :103.0
## Mystery & Suspense: 59 Mean :105.8
## Documentary : 52 3rd Qu.:115.8
## Horror : 23 Max. :267.0
## (Other) : 60 NA's :1
## mpaa_rating studio thtr_rel_year
## G : 19 Paramount Pictures : 37 Min. :1970
## NC-17 : 2 Warner Bros. Pictures : 30 1st Qu.:1990
## PG :118 Sony Pictures Home Entertainment: 27 Median :2000
## PG-13 :133 Universal Pictures : 23 Mean :1998
## R :329 Warner Home Video : 19 3rd Qu.:2007
## Unrated: 50 (Other) :507 Max. :2014
## NA's : 8
## thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month
## Min. : 1.00 Min. : 1.00 Min. :1991 Min. : 1.000
## 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001 1st Qu.: 3.000
## Median : 7.00 Median :15.00 Median :2004 Median : 6.000
## Mean : 6.74 Mean :14.42 Mean :2004 Mean : 6.333
## 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008 3rd Qu.: 9.000
## Max. :12.00 Max. :31.00 Max. :2015 Max. :12.000
## NA's :8 NA's :8
## dvd_rel_day imdb_rating imdb_num_votes critics_rating
## Min. : 1.00 Min. :1.900 Min. : 180 Certified Fresh:135
## 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546 Fresh :209
## Median :15.00 Median :6.600 Median : 15116 Rotten :307
## Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58301
## Max. :31.00 Max. :9.000 Max. :893008
## NA's :8
## critics_score audience_rating audience_score best_pic_nom best_pic_win
## Min. : 1.00 Spilled:275 Min. :11.00 no :629 no :644
## 1st Qu.: 33.00 Upright:376 1st Qu.:46.00 yes: 22 yes: 7
## Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_actor_win best_actress_win best_dir_win top200_box director
## no :558 no :579 no :608 no :636 Length:651
## yes: 93 yes: 72 yes: 43 yes: 15 Class :character
## Mode :character
##
##
##
##
## actor1 actor2 actor3 actor4
## Length:651 Length:651 Length:651 Length:651
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## actor5 imdb_url rt_url
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## title title_type genre runtime
## 0 0 0 1
## mpaa_rating studio thtr_rel_year thtr_rel_month
## 0 8 0 0
## thtr_rel_day dvd_rel_year dvd_rel_month dvd_rel_day
## 0 8 8 8
## imdb_rating imdb_num_votes critics_rating critics_score
## 0 0 0 0
## audience_rating audience_score best_pic_nom best_pic_win
## 0 0 0 0
## best_actor_win best_actress_win best_dir_win top200_box
## 0 0 0 0
## director actor1 actor2 actor3
## 2 2 7 9
## actor4 actor5 imdb_url rt_url
## 13 15 0 0
#imputation runtime with median
movies$runtime[is.na(movies$runtime)] <- median(movies$runtime, na.rm = TRUE)#imputation dvd_rel_year with median
movies$dvd_rel_year[is.na(movies$dvd_rel_year)] <- median(movies$dvd_rel_year, na.rm = TRUE)#imputation dvd_rel_month with median
movies$dvd_rel_month[is.na(movies$dvd_rel_month)] <- median(movies$dvd_rel_month, na.rm = TRUE)#imputation dvd_rel_day with median
movies$dvd_rel_day[is.na(movies$dvd_rel_day)] <- median(is.na(movies$dvd_rel_day))# Plot the distribution of the audience_score variable using a histogram
ggplot(movies, aes(x = audience_score)) +
geom_histogram(binwidth = 10, fill = "palegreen3", color = "black") + # histogram with palegreen fill and black border
labs(title = "Distribution of Audience Score",
x = "Audience Score",
y = "Count") +
theme_minimal()
Interpretation: This chart shows the distribution of
audience scores for the movies. Most films received relatively high
scores, with a peak (the most common scores) around 75 to 87.5. The
distribution is slightly left-skewed, meaning more films have
above-average scores, while a few low-scoring films “pull” the average
downward. Overall, audience scores tend to be positive.
# Calculate correlations between audience_score and numeric variables
cor(movies$audience_score, movies$imdb_rating, use = "complete.obs")## [1] 0.8648652
## [1] 0.7042762
ggpairs(movies[, c("audience_score", "imdb_rating", "critics_score", "runtime", "imdb_num_votes")],
upper = list(continuous = wrap("cor", size = 4)), # korelasi di atas
lower = list(continuous = wrap("points", size=1, alpha=0.5)), # scatter plot bawah
diag = list(continuous = wrap("densityDiag", alpha=0.5)), # density plot diagonal
title = "Pair Plot of Movie Scores and Attributes"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))Interpretation: Pair Plot of Movie Scores and Attributes
audience_score: Most films have high audience scores
(similar to previous plot).imdb_rating: IMDb ratings generally range between 6 and
8.critics_score: Critics scores are quite spread out but
tend to be high for many films.runtime: Most films have standard durations (~90–120
minutes), with some very long outliers.imdb_num_votes: Most films receive relatively few
votes, though some popular ones have many.ggplot(movies, aes(x = imdb_rating, y = audience_score, color = mpaa_rating)) +
geom_point(alpha = 0.6) +
facet_wrap(~ genre) +
theme_minimal() +
labs(
title = "Audience Score vs IMDb Rating by Genre and MPAA Rating",
x = "IMDb Rating",
y = "Audience Score",
color = "MPAA Rating"
)Interpretation: Audience Score vs IMDb Rating by Genre
Overall Trend:
There’s a clear positive relationship — films with higher IMDb ratings
tend to also receive higher audience scores across most genres.
Strong Correlation:
Genres like Action & Adventure, Documentary,
Mystery & Suspense, and Sci-Fi & Fantasy show
tightly clustered points, indicating strong alignment between audience
and IMDb ratings.
More Variation:
Genres such as Comedy, Drama, and Horror
display a similar trend but with more spread, suggesting varied audience
reactions despite similar IMDb scores.
Notable Patterns:
Animation films often score high on both metrics, while Art
House & International films show a positive trend with more
variability.
Conclusion:
IMDb ratings and audience scores generally align across genres, though
the strength of that relationship varies.
# Calculate correlation matrix
cor_mat <- cor(movies[, c("audience_score", "imdb_rating", "critics_score", "runtime", "imdb_num_votes")], use = "complete.obs")
# Convert matrix to long format
cor_mat_melt <- melt(cor_mat)
# Plot heatmap
ggplot(cor_mat_melt, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
coord_fixed() +
labs(title = "Correlation Heatmap")Interpretation: Correlation Heatmap
The heatmap shows the strength of correlation between movie
variables (audience_score, imdb_rating,
critics_score, runtime,
imdb_num_votes). Color intensity indicates correlation
strength:
Key observations:
audience_score and
imdb_ratingimdb_rating and
critics_scoreaudience_score and
critics_scoreruntime and imdb_num_votesimdb_num_votes and the three scores
(audience_score, imdb_rating,
critics_score)runtime and the scores
(audience_score, imdb_rating,
critics_score), indicating duration has less influence on
scores.Summary:
The heatmap visually confirms strong positive relationships between
audience score, IMDb rating, and critics score. Vote count also
correlates positively with these scores, while runtime shows weaker
associations.
# Create boxplots to compare audience_score based on categorical variable best_actor_win
ggplot(movies, aes(x = factor(best_actor_win), y = audience_score, fill = factor(best_actor_win))) +
geom_boxplot() +
scale_fill_manual(values = c("pink", "lightgreen")) +
labs(title = "Audience Score by Best Actor Win",
x = "Best Actor Win",
y = "Audience Score") +
theme_minimal() +
theme(legend.position = "none")# Create boxplots to compare audience_score based on categorical variable best_actor_win
ggplot(movies, aes(x = factor(best_actress_win), y = audience_score, fill = factor(best_actress_win))) +
geom_boxplot() +
scale_fill_manual(values = c("pink", "lightgreen")) +
labs(title = "Audience Score by Best Actress Win",
x = "Best Actress Win",
y = "Audience Score") +
theme_minimal() +
theme(legend.position = "none")# create boxplots for audience_score across different genres
library(ggplot2)
ggplot(movies, aes(x = genre, y = audience_score)) +
geom_boxplot(fill = "lightblue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # rotate x-axis labels for readability
labs(title = "Audience Score by Genre", x = "Genre", y = "Audience Score")Interpretation: Audience Score by Genre
We’ll fit a multiple linear regression model to predict the audience_score using several predictors.
# Fit a linear regression model
model <- lm(audience_score ~ imdb_rating + critics_score + runtime + genre + mpaa_rating, data = movies)
summary(model)##
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score + runtime +
## genre + mpaa_rating, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.016 -6.379 0.511 5.472 50.451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.59073 4.21562 -8.205 1.29e-15 ***
## imdb_rating 15.02730 0.58902 25.512 < 2e-16 ***
## critics_score 0.06320 0.02178 2.902 0.00384 **
## runtime -0.04209 0.02229 -1.888 0.05948 .
## genreAnimation 8.49679 3.83345 2.216 0.02701 *
## genreArt House & International -0.22976 2.97398 -0.077 0.93844
## genreComedy 1.93987 1.63586 1.186 0.23613
## genreDocumentary 0.25499 2.25255 0.113 0.90991
## genreDrama 0.12498 1.41760 0.088 0.92977
## genreHorror -5.38764 2.45102 -2.198 0.02830 *
## genreMusical & Performing Arts 4.47095 3.16247 1.414 0.15793
## genreMystery & Suspense -5.86328 1.82390 -3.215 0.00137 **
## genreOther 1.59775 2.77788 0.575 0.56538
## genreScience Fiction & Fantasy -0.36507 3.50848 -0.104 0.91716
## mpaa_ratingNC-17 -3.75396 7.44514 -0.504 0.61429
## mpaa_ratingPG 1.12300 2.71223 0.414 0.67898
## mpaa_ratingPG-13 -0.08796 2.79094 -0.032 0.97487
## mpaa_ratingR 0.18145 2.68803 0.068 0.94620
## mpaa_ratingUnrated 0.99141 3.07018 0.323 0.74686
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.822 on 632 degrees of freedom
## Multiple R-squared: 0.7706, Adjusted R-squared: 0.7641
## F-statistic: 118 on 18 and 632 DF, p-value: < 2.2e-16
# Plot residuals to check assumptions visually
par(mfrow = c(2, 2)) # 2x2 plot layout for diagnostic plots
plot(model)# Predict Audience Score using the model (optional)
movies$predicted_audience_score <- predict(model, newdata = movies)
# Plot actual vs predicted Audience Score
ggplot(movies, aes(x = predicted_audience_score, y = audience_score)) +
geom_point(alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Actual vs Predicted Audience Score",
x = "Predicted Audience Score",
y = "Actual Audience Score") +
theme_minimal()# Fit the model
model <- lm(audience_score ~ imdb_rating + critics_score + runtime + genre, data = movies)
# Extract model summary with CI and significance
model_coef <- tidy(model, conf.int = TRUE) %>%
filter(term != "(Intercept)") %>% # Remove intercept
mutate(significant = ifelse(p.value < 0.05, "Significant", "Not Significant"))
# Plot with horizontal error bars and color by significance
ggplot(model_coef, aes(x = estimate, y = reorder(term, estimate), color = significant)) +
geom_point(size = 3) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
geom_vline(xintercept = 0, linetype = "dashed") +
scale_color_manual(values = c("Significant" = "red", "Not Significant" = "black")) +
labs(
title = "Linear Regression Coefficients Predicting Audience Score (95% CI)",
x = "Estimate",
y = "Predictor",
color = "Significance"
) +
theme_minimal()model2 <- lm(imdb_rating ~ audience_score + critics_score + runtime + genre, data = movies)
summary(model2)##
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + runtime +
## genre, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.34404 -0.20172 0.03557 0.27061 1.17348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.167317 0.125054 25.328 < 2e-16 ***
## audience_score 0.033883 0.001320 25.672 < 2e-16 ***
## critics_score 0.010407 0.000937 11.106 < 2e-16 ***
## runtime 0.005298 0.001017 5.207 2.6e-07 ***
## genreAnimation -0.367862 0.166787 -2.206 0.02777 *
## genreArt House & International 0.199889 0.137566 1.453 0.14670
## genreComedy -0.140961 0.076620 -1.840 0.06627 .
## genreDocumentary 0.266504 0.093978 2.836 0.00472 **
## genreDrama 0.057439 0.065519 0.877 0.38099
## genreHorror 0.095299 0.114098 0.835 0.40390
## genreMusical & Performing Arts 0.015918 0.149086 0.107 0.91501
## genreMystery & Suspense 0.261303 0.084593 3.089 0.00210 **
## genreOther -0.059825 0.131085 -0.456 0.64827
## genreScience Fiction & Fantasy -0.191440 0.165917 -1.154 0.24900
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4654 on 637 degrees of freedom
## Multiple R-squared: 0.8196, Adjusted R-squared: 0.8159
## F-statistic: 222.6 on 13 and 637 DF, p-value: < 2.2e-16
model2 <- lm(imdb_rating ~ audience_score + critics_score + runtime + genre, data = movies)
model2_coef <- tidy(model2, conf.int = TRUE)
model2_coef$significant <- ifelse(model2_coef$p.value < 0.05, "Significant", "Not Significant")
ggplot(model2_coef[-1, ], aes(x = estimate, y = term, color = significant)) + # Remove intercept
geom_point(size = 3) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
geom_vline(xintercept = 0, linetype = "dashed") +
scale_color_manual(values = c("Significant" = "red", "Not Significant" = "black")) +
labs(title = "Linear Regression Coefficients predicting IMDb Rating (95% CI)",
x = "Estimate",
y = "Predictor",
color = "Significance") +
theme_minimal()# Prepare new data
new_movies <- data.frame(
title = c("Zootopia", "Deadpool", "La La Land", "Before the Flood", "The Witch"),
genre = factor(c("Animation",
"Action & Adventure",
"Musical & Performing Arts",
"Documentary",
"Horror"),
levels = levels(movies$genre)),
runtime = c(108, 108, 128, 96, 92),
mpaa_rating = factor(c("PG", "R", "PG-13", "PG", "R"),
levels = levels(movies$mpaa_rating)),
imdb_rating = c(8.0, 8.0, 8.0, 8.2, 6.9),
critics_score = c(98, 85, 91, 75, 90),
critics_source = c("Rotten Tomatoes",
"Metacritic",
"IMDb Metascore",
"Geo National",
"Rotten Tomatoes"),
audience_score = c(92, 96, 81, 85, 55)
)
print(new_movies)## title genre runtime mpaa_rating imdb_rating
## 1 Zootopia Animation 108 PG 8.0
## 2 Deadpool Action & Adventure 108 R 8.0
## 3 La La Land Musical & Performing Arts 128 PG-13 8.0
## 4 Before the Flood Documentary 96 PG 8.2
## 5 The Witch Horror 92 R 6.9
## critics_score critics_source audience_score
## 1 98 Rotten Tomatoes 92
## 2 85 Metacritic 96
## 3 91 IMDb Metascore 81
## 4 75 Geo National 85
## 5 90 Rotten Tomatoes 55
# Prepare data for prediction by selecting only the variables used in the model
new_movies_predict <- new_movies[, c("genre", "runtime", "mpaa_rating", "imdb_rating", "critics_score")]
# We exclude columns like 'title' and 'critics_source' because they are not predictors in the model
# Use the model to predict audience scores for new movies
# Also calculate 95% prediction intervals to quantify uncertainty around predictions
predictions <- predict(model, newdata = new_movies_predict, interval = "prediction", level = 0.95)
# Add predicted scores and intervals back to the data frame
new_movies$predicted_audience_score <- predictions[, "fit"] # predicted values
new_movies$pred_lower <- predictions[, "lwr"] # lower bound of 95% prediction interval
new_movies$pred_upper <- predictions[, "upr"] # upper bound of 95% prediction interval
# Show movie title, predicted score, interval, and critics source
print(new_movies[, c("title",
"audience_score",
"predicted_audience_score",
"pred_lower",
"pred_upper")])## title audience_score predicted_audience_score pred_lower
## 1 Zootopia 92 96.18022 75.82879
## 2 Deadpool 96 86.89502 67.44380
## 3 La La Land 81 90.99426 70.96088
## 4 Before the Flood 85 90.32078 70.87461
## 5 The Witch 55 65.92491 46.22797
## pred_upper
## 1 116.53166
## 2 106.34625
## 3 111.02764
## 4 109.76696
## 5 85.62185
Based on the model predictions, Zootopia is expected to have an audience score of about 96.25, with a likely range between 75.88 and 116.61. This means the actual score is expected to fall within that range. Deadpool is predicted to score around 86.87, La La Land around 91.06, and Before the Flood around 90.22 — all with similar ranges of uncertainty. Among the five movies, The Witch has the lowest predicted score at 65.90, with a range from 46.19 to 85.61. These results show that while the model gives an estimated score, there’s always some uncertainty, which is reflected in the prediction intervals.
Based on the linear regression analysis, we conclude that IMDb rating is the strongest and most significant predictor of audience score, followed by critics score. Meanwhile, the runtime variable shows a small negative effect that is marginally significant. Certain genres, such as Animation, have a positive and significant impact, while Horror and Mystery & Suspense tend to lower audience scores. On the other hand, MPAA ratings do not appear to have a significant influence after accounting for other variables.
The model performs fairly well, with an Adjusted R² of 76%, indicating that a substantial portion of the variation in audience scores can be explained by the predictors in the model. However, there is evidence of heteroskedasticity and the presence of influential outliers, which may affect the accuracy and reliability of coefficient estimates. As a next step, applying data transformations or using robust regression methods could improve the model’s validity.
Overall, this model offers valuable insights into the factors that shape audience ratings, especially regarding IMDb scores and film genres.