data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

Logistic Regression Model: Predicting Adult-Rated Films

Approach:

  • For this assignment, I’m building a logistic regression model to predict whether a movie is labeled as “adult” or not. The logic behind this is that certain features like a movie’s popularity or budget might be linked to whether a movie is made for mature audiences.

  • Using logistic regression, I will:

    • Convert the adult column to binary (1 to adult, 0 otherwise)

    • Use a few explanatory variables to build a logistic model

    • Interpret the model’s coefficients

    • Construct a confidence interval for one of the variables

Data Preparation

  • Here I prepare my data, converting the adult column to binary, having it as 1 if its true or 0 if its false.

  • I also chose my predictor variables to be popularity, budget, and vote_count as the values can be affected by if a movie is mature or not.

  • I also remove any NA or empty values.

data$adult <- ifelse(data$adult == "True", 1, 0)

data$popularity <- as.numeric(data$popularity)
## Warning: NAs introduced by coercion
data$budget <- as.numeric(data$budget)
## Warning: NAs introduced by coercion
data$vote_count <- as.numeric(data$vote_count)

data <- na.omit(data)
data <- data[data$budget > 0 & data$vote_count >= 0 & data$popularity >= 0, ]

Building Our Model

  • Here we build our logistic regression model, using the logit function to estimate the log-odds of a movie being adult-rated. We also use the binomial function to match the binary nature of our response variable (adult).
model <- glm(adult ~ popularity + budget + vote_count, data = data, family = binomial(link = "logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
## 
## Call:
## glm(formula = adult ~ popularity + budget + vote_count, family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.929e+00  1.394e+00  -4.972 6.63e-07 ***
## popularity  -3.558e-01  9.259e-01  -0.384    0.701    
## budget      -4.836e-07  9.142e-07  -0.529    0.597    
## vote_count  -5.880e-04  2.905e-02  -0.020    0.984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20.183  on 8879  degrees of freedom
## Residual deviance: 17.410  on 8876  degrees of freedom
## AIC: 25.41
## 
## Number of Fisher Scoring iterations: 19

Insights:

  • The intercept of the model being -6.93 tells us that when all the predictors are 0, the odds of a movie being adult-rated are extremely low, which is expected as adult-films are very RARE.

  • None of the predictors are statistically significant as all their p-values are greater than 0.05. This means we don’t have enough evidence that any of them meaningfully affect an adult rating.

  • We also got a warning message saying “fitted probabilities numerically 0 or 1 occurred” meaning that the model gave extremely confident predictions for some data points.

Constructing a Confidence Interval

  • Here we construct our 95% CI based on popularity. It tells us we are 95% confident that popularity effects if a movie has an adult rating. If the interval includes doesn’t include 0 that means it has a meaningful effect.
coef_summary <- summary(model)$coefficients

est <- coef_summary["popularity", "Estimate"]
se <- coef_summary["popularity", "Std. Error"]

lower <- est - 1.96 * se
upper <- est + 1.96 * se

c(lower, upper)
## [1] -2.170447  1.458940
  • The confidence interval ranges between -2.17 and 1.46, which shows that it includes 0, so we do not have enough evidence that popularity is a meaningful predictor of whether a movie is adult rated.

  • We can’t confidently say whether it increases or decreases the odds of a film being adult rated.

Conclusion

In conclusion, the Logistic Regression Model suggests that none of the chosen predictors significantly impact whether a movie is adult rated. The results show weak evidence of a relationship as well as the confidence intervals for popularity including zero.