data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(lindia)
## Warning: package 'lindia' was built under R version 4.3.3
library(ggplot2)
Last week I created a linear regression model between revenue and vote count to explore whether financial success leads to more audience engagement.
This week, I am expanding the model by adding two new variables: popularity and an interaction term (budget * runtime) to see if they improve the model’s explanatory power. The idea is that popularity reflects audience interest and visibility, while budget and runtime together may indicate production scale, both of which could influence vote count.
Using multiple linear regression I will:
Check that all variables have reasonably linear relationships
Expand on my model with revenue, popularity, and budget*runtime
Interpret each coefficient to view how each variable influences vote_count.
This will help determine whether a combination of variables can better explain variations in audience voting rather than using revenue alone.
Here I prepped the data by converting all the variables to numeric. I also created the interaction term (budget*runtime).
Next I filtered out any Null values from each variable.
data$revenue <- as.numeric(data$revenue)
data$vote_count <- as.numeric(data$vote_count)
data$popularity <- as.numeric(data$popularity)
## Warning: NAs introduced by coercion
data$budget <- as.numeric(data$budget)
## Warning: NAs introduced by coercion
data$runtime <- as.numeric(data$runtime)
data <- na.omit(data)
data <- data[data$revenue > 0 & data$vote_count > 0 & data$budget > 0 & data$runtime > 0, ]
model <- lm(vote_count ~ revenue + popularity + budget * runtime, data = data)
summary(model)
##
## Call:
## lm(formula = vote_count ~ revenue + popularity + budget * runtime,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9428.5 -258.1 -125.0 55.6 8934.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.342e+01 6.675e+01 -0.351 0.7257
## revenue 4.771e-06 9.959e-08 47.906 < 2e-16 ***
## popularity 1.566e+01 8.451e-01 18.528 < 2e-16 ***
## budget -6.478e-06 1.370e-06 -4.730 2.31e-06 ***
## runtime 1.142e+00 5.944e-01 1.922 0.0547 .
## budget:runtime 6.915e-08 1.132e-08 6.111 1.06e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 768.3 on 5356 degrees of freedom
## Multiple R-squared: 0.6243, Adjusted R-squared: 0.624
## F-statistic: 1780 on 5 and 5356 DF, p-value: < 2.2e-16
Most predictors are statistically significant,
as shown by their very low p-values (e.g., revenue, popularity, budget,
and the interaction term). This means these variables have a meaningful
impact on predicting vote_count
.
The R-Squared value increased in the new model in comparison to the old model, it went from 0.5935 up to 0.6243, indicating a stronger relationship with the new predictors.
Budget X Runtime is shown to be highly significant, having a very strong positive effect on vote_count. It’s coefficient is extremely low meaning even slight changes in this value influence vote_count significantly. It also shows that interaction terms do matter in explaining vote_count
This plot checks for linearity and constant variance of residuals.
gg_resfitted(model) +
geom_smooth(se = FALSE) +
labs(title = "Residuals vs. Fitted Values")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
These plots compares the residuals with each predictor, checking for linearity and homoscedasticity.
plots <- gg_resX(model, plot.all = FALSE)
plots$revenue + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
plots$popularity + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_smooth()`).
# Manually calculate residuals
data$residuals <- resid(model)
# Plot residuals vs. interaction term
ggplot(data, aes(x = budget * runtime, y = residuals)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE, color = "blue") +
labs(title = "Residuals vs. Budget * Runtime",
x = "Budget * Runtime", y = "Residuals") +
theme_minimal()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Each variable indicates that it has a curve indicating non-linearity across each predictor.
Budget_Runtime has the least amount of curve suggesting a mild non-linearity.
This plot checks if the residuals are normally distributed which is needed in hypothesis testing.
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This plot compares the residual distribution to a theoretical normal distribution, checking for normality.
gg_qqplot(model)
The plot shows a noticeable movement from the diagonal line at both ends suggesting that the residuals are not perfectly normally distributed.
The middle area does hold close to the diagonal line suggesting that there is an area that satisfies the normality assumption.
This plot identifies observations that strongly influence regression results.
gg_cooksd(model, threshold = "matlab")
The plot shows that most data points have low influence on the model.
There are some data points like most noticeably 4793 and 5258 that stand out as highly influential, they could be outliers or contain unique characteristics that affect the regression results.
In conclusion, the multiple linear regression model shows that revenue, popularity, and budget_runtime all significantly influence vote_count. Although the model does mildly violate some assumptions it shows a decent fit with an adjusted R-Squared value of 0.62. It also offers a stronger explanation than using revenue alone.