data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

Poisson Regression Modeling: Predicting Runtime from Budget, Revenue and Popularity

Approach:

  • For this analysis, I’m building a generalized linear model with Poisson regression to predict a movie’s runtime based on its budget, revenue and popularity. I’ve always heard that movies with bigger budgets or more popular films tend to be longer, possibly due to higher production value or more elaborate storytelling. By including these variables, I want to see whether there is a connection between them and a movie’s runtime.

  • Using this model, I will:

    • Fit a Poisson regression model using runtime

    • Diagnose the model using the 5 diagnostic plots

    • Interpret one of the coefficients

    • Highlight any assumptions or modeling issues

Data Preparation:

  • Before building the model, I convert all relevant columns to numeric, I then remove any invalid or zero values to ensure clean input.
data$runtime <- as.numeric(data$runtime)
data$budget <- as.numeric(data$budget)
## Warning: NAs introduced by coercion
data$revenue <- as.numeric(data$revenue)
data$popularity <- as.numeric(data$popularity)
## Warning: NAs introduced by coercion
data <- na.omit(data)
data <- data[data$runtime > 0 & data$budget > 0 & data$revenue > 0 & data$popularity > 0, ]

Building Our Model:

  • Here I fit the model with runtime as our response variable. The Poisson family with a log link is valid because of runtime is always a positive integer.
model <- glm(runtime ~ budget + revenue + popularity, 
             data = data, family = poisson(link = "log"))

summary(model)
## 
## Call:
## glm(formula = runtime ~ budget + revenue + popularity, family = poisson(link = "log"), 
##     data = data)
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) 4.673e+00  1.773e-03 2636.218   <2e-16 ***
## budget      5.361e-10  4.574e-11   11.721   <2e-16 ***
## revenue     1.101e-10  1.147e-11    9.597   <2e-16 ***
## popularity  7.871e-05  9.831e-05    0.801    0.423    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 21034  on 5368  degrees of freedom
## Residual deviance: 20132  on 5365  degrees of freedom
## AIC: 55164
## 
## Number of Fisher Scoring iterations: 4
  • Budget and Revenue both have very low p-values (< 2e-16) , indicating that they are meaningful predictors of a movie’s runtime.

  • Popularity has a high p-value of 0.423, indicating that it is not statistically significant. This suggests popularity does not have a strong effect on runtime when controlling for budget and revenue.

  • The AIC is 55164, which helps compare models—lower is better—but on its own doesn’t say much.

  • The residual deviance is only a bit lower than the null deviance, so the model doesn’t explain much more than just using the average.

5 Diagnostic Plots

  • Now we will diagnose our model using the 5 diagnostic plots.

Residuals vs. Fitted Value

plot(model, which = 1)

  • The plot shows a slight funnel shape with some heteroscedasticity. The model also struggles to predict lower runtimes consistently. It also shows a few outliers with large residuals.

Residuals vs. Each Predictor

# Create residuals column
data$resid <- residuals(model, type = "pearson")

# Residuals vs Budget
ggplot(data, aes(x = budget, y = resid)) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = FALSE) +
  labs(title = "Residuals vs Budget")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

# Residuals vs Revenue
ggplot(data, aes(x = revenue, y = resid)) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = FALSE) +
  labs(title = "Residuals vs Revenue")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

# Residuals vs Popularity
ggplot(data, aes(x = popularity, y = resid)) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = FALSE) +
  labs(title = "Residuals vs Popularity")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

  • Residuals vs. Budget shows a slight upward curve indicating that it hints at a weak non-linear relationship. Residuals vs. Revenue shows a fairly even spread indicating that it is a good fit. Residuals vs. Popularity shows a clear curve with non-linearity which it may need transformation to fix.

Histogram of Residuals

hist(residuals(model, type = "pearson"),
     breaks = 30,
     main = "Histogram of Pearson Residuals",
     xlab = "Pearson Residuals",
     col = "lightgreen",
     border = "white")

  • The distribution is slightly right-skewed, meaning most residuals are near zero, but there are a few high positive residuals. It suggests that the model somtimes underpredicts runtime for a subset of movies.

Normal QQ Plot

plot(model, which = 2)

  • The residuals deviate from the diagonal line in the upper tail, suggesting that the residuals are not perfectly normally distributed.

Cook’s Distance

plot(model, which = 4)

  • Most observations have low influence, but one point (observation 30701) is highly influential and should be checked. It could be an outlier or data entry error that’s affecting the model’s coefficients.

Interpreting A Coefficient

After diagnosing the model we now move to interpreting our coefficient, I chose to interpret revenue.

coef_revenue <- coef(model)["revenue"]
exp(coef_revenue)
## revenue 
##       1
  • A one-unit increase in revenue has no measurable effect on predicted runtime. It shows that revenue doesn't meaningfully influence runtime in this model.

Conclusion

  • The Poisson model shows that budget has some effect on runtime, while revenue and popularity don’t. However, the diagnostics suggestsed that the model isn’t a perfect fit indicating that other factors likely influence runtime more.