data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
For this analysis, I’m building a generalized linear model with Poisson regression to predict a movie’s runtime based on its budget, revenue and popularity. I’ve always heard that movies with bigger budgets or more popular films tend to be longer, possibly due to higher production value or more elaborate storytelling. By including these variables, I want to see whether there is a connection between them and a movie’s runtime.
Using this model, I will:
Fit a Poisson regression model using runtime
Diagnose the model using the 5 diagnostic plots
Interpret one of the coefficients
Highlight any assumptions or modeling issues
data$runtime <- as.numeric(data$runtime)
data$budget <- as.numeric(data$budget)
## Warning: NAs introduced by coercion
data$revenue <- as.numeric(data$revenue)
data$popularity <- as.numeric(data$popularity)
## Warning: NAs introduced by coercion
data <- na.omit(data)
data <- data[data$runtime > 0 & data$budget > 0 & data$revenue > 0 & data$popularity > 0, ]
model <- glm(runtime ~ budget + revenue + popularity,
data = data, family = poisson(link = "log"))
summary(model)
##
## Call:
## glm(formula = runtime ~ budget + revenue + popularity, family = poisson(link = "log"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.673e+00 1.773e-03 2636.218 <2e-16 ***
## budget 5.361e-10 4.574e-11 11.721 <2e-16 ***
## revenue 1.101e-10 1.147e-11 9.597 <2e-16 ***
## popularity 7.871e-05 9.831e-05 0.801 0.423
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 21034 on 5368 degrees of freedom
## Residual deviance: 20132 on 5365 degrees of freedom
## AIC: 55164
##
## Number of Fisher Scoring iterations: 4
Budget and Revenue both have very low p-values (< 2e-16) , indicating that they are meaningful predictors of a movie’s runtime.
Popularity has a high p-value of 0.423, indicating that it is not statistically significant. This suggests popularity does not have a strong effect on runtime when controlling for budget and revenue.
The AIC is 55164, which helps compare models—lower is better—but on its own doesn’t say much.
The residual deviance is only a bit lower than the null deviance, so the model doesn’t explain much more than just using the average.
plot(model, which = 1)
# Create residuals column
data$resid <- residuals(model, type = "pearson")
# Residuals vs Budget
ggplot(data, aes(x = budget, y = resid)) +
geom_point(alpha = 0.5) +
geom_smooth(se = FALSE) +
labs(title = "Residuals vs Budget")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
# Residuals vs Revenue
ggplot(data, aes(x = revenue, y = resid)) +
geom_point(alpha = 0.5) +
geom_smooth(se = FALSE) +
labs(title = "Residuals vs Revenue")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
# Residuals vs Popularity
ggplot(data, aes(x = popularity, y = resid)) +
geom_point(alpha = 0.5) +
geom_smooth(se = FALSE) +
labs(title = "Residuals vs Popularity")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
hist(residuals(model, type = "pearson"),
breaks = 30,
main = "Histogram of Pearson Residuals",
xlab = "Pearson Residuals",
col = "lightgreen",
border = "white")
plot(model, which = 2)
plot(model, which = 4)
After diagnosing the model we now move to interpreting our coefficient, I chose to interpret revenue.
coef_revenue <- coef(model)["revenue"]
exp(coef_revenue)
## revenue
## 1