data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(car)

## Warning: package 'car' was built under R version 4.3.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.3

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(lindia)

## Warning: package 'lindia' was built under R version 4.3.3

library(ggplot2)

Linear Regression Between Vote_Count and Other Variables

Approach:

Last week I created a linear regression model between revenue and vote count to explore whether financial success leads to more audience engagement.
This week, I am expanding the model by adding two new variables: popularity and an interaction term (budget * runtime) to see if they improve the model’s explanatory power. The idea is that popularity reflects audience interest and visibility, while budget and runtime together may indicate production scale, both of which could influence vote count.
Using multiple linear regression I will:
- Check that all variables have reasonably linear relationships
- Expand on my model with revenue, popularity, and budget*runtime
- Interpret each coefficient to view how each variable influences vote_count.
This will help determine whether a combination of variables can better explain variations in audience voting rather than using revenue alone.

Prepping the Data

Here I prepped the data by converting all the variables to numeric. I also created the interaction term (budget*runtime).
Next I filtered out any Null values from each variable.

data$revenue <- as.numeric(data$revenue)
data$vote_count <- as.numeric(data$vote_count)
data$popularity <- as.numeric(data$popularity)

## Warning: NAs introduced by coercion

data$budget <- as.numeric(data$budget)

## Warning: NAs introduced by coercion

data$runtime <- as.numeric(data$runtime)

data <- na.omit(data)
data <- data[data$revenue > 0 & data$vote_count > 0 & data$budget > 0 & data$runtime > 0, ]

Building the Model

Here we fit the multiple linear regression model, adding popularity and budget_runtime alongside revenue from the previous model.

model <- lm(vote_count ~ revenue + popularity + budget * runtime, data = data)

summary(model)

## 
## Call:
## lm(formula = vote_count ~ revenue + popularity + budget * runtime, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9428.5  -258.1  -125.0    55.6  8934.2 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.342e+01  6.675e+01  -0.351   0.7257    
## revenue         4.771e-06  9.959e-08  47.906  < 2e-16 ***
## popularity      1.566e+01  8.451e-01  18.528  < 2e-16 ***
## budget         -6.478e-06  1.370e-06  -4.730 2.31e-06 ***
## runtime         1.142e+00  5.944e-01   1.922   0.0547 .  
## budget:runtime  6.915e-08  1.132e-08   6.111 1.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 768.3 on 5356 degrees of freedom
## Multiple R-squared:  0.6243, Adjusted R-squared:  0.624 
## F-statistic:  1780 on 5 and 5356 DF,  p-value: < 2.2e-16

Most predictors are statistically significant, as shown by their very low p-values (e.g., revenue, popularity, budget, and the interaction term). This means these variables have a meaningful impact on predicting vote_count.
The R-Squared value increased in the new model in comparison to the old model, it went from 0.5935 up to 0.6243, indicating a stronger relationship with the new predictors.
Budget X Runtime is shown to be highly significant, having a very strong positive effect on vote_count. It’s coefficient is extremely low meaning even slight changes in this value influence vote_count significantly. It also shows that interaction terms do matter in explaining vote_count

Moving On: Diagnostic Plots

Now that we have evaluated and fit our regression model, let’s move on to creating and assessing diagnostic plots to see whether our model meets linear regression.

Residuals vs. Fitted Values

This plot checks for linearity and constant variance of residuals.

gg_resfitted(model) +
  geom_smooth(se = FALSE) +
  labs(title = "Residuals vs. Fitted Values")

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The plot shows a downward curve violating the linearity assumption and suggests a non constant variance.

Residuals vs. Each Predictor Variable

These plots compares the residuals with each predictor, checking for linearity and homoscedasticity.

plots <- gg_resX(model, plot.all = FALSE)
plots$revenue + geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

plots$popularity + geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_smooth()`).

# Manually calculate residuals
data$residuals <- resid(model)

# Plot residuals vs. interaction term
ggplot(data, aes(x = budget * runtime, y = residuals)) +
  geom_point(alpha = 0.3) +
  geom_smooth(se = FALSE, color = "blue") +
  labs(title = "Residuals vs. Budget * Runtime",
       x = "Budget * Runtime", y = "Residuals") +
  theme_minimal()

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Each variable indicates that it has a curve indicating non-linearity across each predictor.
Budget_Runtime has the least amount of curve suggesting a mild non-linearity.

Histogram of Residuals

This plot checks if the residuals are normally distributed which is needed in hypothesis testing.

gg_reshist(model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram is mostly centered around 0, suuggesting that the errors have an approximate normal distribution, however the skew may suggest there may be influential points.

QQ Plot

This plot compares the residual distribution to a theoretical normal distribution, checking for normality.

gg_qqplot(model)

The plot shows a noticeable movement from the diagonal line at both ends suggesting that the residuals are not perfectly normally distributed.
The middle area does hold close to the diagonal line suggesting that there is an area that satisfies the normality assumption.

Cook’s Distance

This plot identifies observations that strongly influence regression results.

gg_cooksd(model, threshold = "matlab")

The plot shows that most data points have low influence on the model.
There are some data points like most noticeably 4793 and 5258 that stand out as highly influential, they could be outliers or contain unique characteristics that affect the regression results.

Conclusion

In conclusion, the multiple linear regression model shows that revenue, popularity, and budget_runtime all significantly influence vote_count. Although the model does mildly violate some assumptions it shows a decent fit with an adjusted R-Squared value of 0.62. It also offers a stronger explanation than using revenue alone.

Data Dive Week 9

2025-03-21