data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
For this assignment, I’m building a logistic regression model to predict whether a movie is labeled as “adult” or not. The logic behind this is that certain features like a movie’s popularity or budget might be linked to whether a movie is made for mature audiences.
Using logistic regression, I will:
Convert the adult column to binary (1 to adult, 0 otherwise)
Use a few explanatory variables to build a logistic model
Interpret the model’s coefficients
Construct a confidence interval for one of the variables
Here I prepare my data, converting the adult column to binary, having it as 1 if its true or 0 if its false.
I also chose my predictor variables to be popularity, budget, and vote_count as the values can be affected by if a movie is mature or not.
I also remove any NA or empty values.
data$adult <- ifelse(data$adult == "True", 1, 0)
data$popularity <- as.numeric(data$popularity)
## Warning: NAs introduced by coercion
data$budget <- as.numeric(data$budget)
## Warning: NAs introduced by coercion
data$vote_count <- as.numeric(data$vote_count)
data <- na.omit(data)
data <- data[data$budget > 0 & data$vote_count >= 0 & data$popularity >= 0, ]
model <- glm(adult ~ popularity + budget + vote_count, data = data, family = binomial(link = "logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
##
## Call:
## glm(formula = adult ~ popularity + budget + vote_count, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.929e+00 1.394e+00 -4.972 6.63e-07 ***
## popularity -3.558e-01 9.259e-01 -0.384 0.701
## budget -4.836e-07 9.142e-07 -0.529 0.597
## vote_count -5.880e-04 2.905e-02 -0.020 0.984
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20.183 on 8879 degrees of freedom
## Residual deviance: 17.410 on 8876 degrees of freedom
## AIC: 25.41
##
## Number of Fisher Scoring iterations: 19
Insights:
The intercept of the model being -6.93 tells us that when all the predictors are 0, the odds of a movie being adult-rated are extremely low, which is expected as adult-films are very RARE.
None of the predictors are statistically significant as all their p-values are greater than 0.05. This means we don’t have enough evidence that any of them meaningfully affect an adult rating.
We also got a warning message saying “fitted probabilities numerically 0 or 1 occurred” meaning that the model gave extremely confident predictions for some data points.
coef_summary <- summary(model)$coefficients
est <- coef_summary["popularity", "Estimate"]
se <- coef_summary["popularity", "Std. Error"]
lower <- est - 1.96 * se
upper <- est + 1.96 * se
c(lower, upper)
## [1] -2.170447 1.458940
The confidence interval ranges between -2.17 and 1.46, which shows that it includes 0, so we do not have enough evidence that popularity is a meaningful predictor of whether a movie is adult rated.
We can’t confidently say whether it increases or decreases the odds of a film being adult rated.
In conclusion, the Logistic Regression Model suggests that none of the chosen predictors significantly impact whether a movie is adult rated. The results show weak evidence of a relationship as well as the confidence intervals for popularity including zero.