This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(stringr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(modelr)
movies <- read.csv("C:/Users/DELL/Downloads/imdb.csv")
str(movies)
## 'data.frame': 377 obs. of 12 variables:
## $ country : chr "Creed III" "Avatar: The Way of Water" "The Super Mario Bros. Movie" "Mummies" ...
## $ date_x : chr "03-02-2023" "12/15/2022 " "04-05-2023" "01-05-2023" ...
## $ score : int 73 78 76 70 61 66 80 83 59 58 ...
## $ genre : chr "Drama,?\xffAction" "Science Fiction,?\xffAdventure,?\xffAction" "Animation,?\xffAdventure,?\xffFamily,?\xffFantasy,?\xffComedy" "Animation,?\xffComedy,?\xffFamily,?\xffAdventure,?\xffFantasy" ...
## $ overview : chr "After dominating the boxing world, Adonis Creed has been thriving in both his career and family life. When a childhood friend a "Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), "While working underground to fix a water main, Brooklyn plumbers\x83??and brothers\x83??Mario and Luigi are transported down a "Through a series of unfortunate events, three mummies end up in present-day London and embark on a wacky and hilarious journey ...
## $ crew : chr "Michael B. Jordan, Adonis Creed, Tessa Thompson, Bianca Taylor, Jonathan Majors, Damien Anderson, Wood Harris, Tony 'Little Duk "Sam Worthington, Jake Sully, Zoe Salda?\xf1a, Neytiri, Sigourney Weaver, Kiri / Dr. Grace Augustine, Stephen Lang, Colonel Mile "Chris Pratt, Mario (voice), Anya Taylor-Joy, Princess Peach (voice), Charlie Day, Luigi (voice), Jack Black, Bowser (voice), Ke "??scar Barber?\xadn, Thut (voice), Ana Esther Alborg, Nefer (voice), Luis P??rez Reina, Carnaby (voice), Mar??a Luisa Sol?\xad, ...
## $ orig_title: chr "Creed III" "Avatar: The Way of Water" "The Super Mario Bros. Movie" " Momias" ...
## $ status : chr " Released" " Released" " Released" " Released" ...
## $ orig_lang : chr " English" " English" " English" " Spanish, Castilian" ...
## $ budget_x : num 7.50e+07 4.60e+08 1.00e+08 1.23e+07 7.70e+07 ...
## $ revenue : num 2.72e+08 2.32e+09 7.24e+08 3.42e+07 3.41e+08 ...
## $ X : chr "AU" "AU" "AU" "AU" ...
summary(movies)
## country date_x score genre
## Length:377 Length:377 Min. : 0.00 Length:377
## Class :character Class :character 1st Qu.: 63.00 Class :character
## Mode :character Mode :character Median : 70.00 Mode :character
## Mean : 68.26
## 3rd Qu.: 76.00
## Max. :100.00
## overview crew orig_title status
## Length:377 Length:377 Length:377 Length:377
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## orig_lang budget_x revenue X
## Length:377 Min. : 105 Min. :0.000e+00 Length:377
## Class :character 1st Qu.: 38211149 1st Qu.:1.020e+08 Class :character
## Mode :character Median : 92600000 Median :3.526e+08 Mode :character
## Mean : 96816280 Mean :4.080e+08
## 3rd Qu.:136400000 3rd Qu.:5.645e+08
## Max. :460000000 Max. :2.924e+09
movies <- movies %>%
mutate(high_rating = as.numeric(score >= 75))
table(movies$high_rating)
##
## 0 1
## 263 114
Logistic Regression Model
Let’s build a logistic regression model to predict high rating based on budget and genre. Budget is continuous, while genre is a categorical variable that we’ll need to dummy code.
movies <- movies %>%
mutate(action = as.numeric(str_detect(genre, "Action")),
drama = as.numeric(str_detect(genre, "Drama")),
comedy = as.numeric(str_detect(genre, "Comedy")))
m1 <- glm(high_rating ~ budget_x + action + drama + comedy, data = movies, family = "binomial")
summary(m1)
##
## Call:
## glm(formula = high_rating ~ budget_x + action + drama + comedy,
## family = "binomial", data = movies)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.475e+00 2.669e-01 -5.526 3.27e-08 ***
## budget_x 3.356e-09 1.593e-09 2.108 0.0351 *
## action -1.196e-01 2.528e-01 -0.473 0.6362
## drama 6.731e-01 2.782e-01 2.419 0.0155 *
## comedy 5.788e-01 2.598e-01 2.228 0.0259 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.11 on 376 degrees of freedom
## Residual deviance: 449.39 on 372 degrees of freedom
## AIC: 459.39
##
## Number of Fisher Scoring iterations: 4
Transforming Variables
The budget variable ranges from very small to hundreds of millions. We can try a log transformation to improve the fit.
m2 <- glm(high_rating ~ log(budget_x) + action + drama + comedy,
data = movies, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_rating ~ log(budget_x) + action + drama +
## comedy, family = "binomial", data = movies)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.28059 1.36635 -1.669 0.0951 .
## log(budget_x) 0.06471 0.07606 0.851 0.3949
## action -0.07590 0.24990 -0.304 0.7613
## drama 0.60896 0.27424 2.221 0.0264 *
## comedy 0.52688 0.25875 2.036 0.0417 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.11 on 376 degrees of freedom
## Residual deviance: 453.01 on 372 degrees of freedom
## AIC: 463.01
##
## Number of Fisher Scoring iterations: 4
scatter plots would be helpful here to visualize the relationship between budget and high ratings before and after the log transformation.
ggplot(movies, aes(x=budget_x, y=high_rating)) +
geom_point(alpha=0.1)
This shows the relationship between raw budget and high rating. There is a positive correlation, but the large positive budgets on the right side have high leverage.
Now with log budget:
movies <- movies %>%
mutate(log_budget = log(budget_x))
ggplot(movies, aes(x=log_budget, y=high_rating)) +
geom_point(alpha=0.1)
The log transformation compressed the scale for very large budgets. The
trend is still positive but more linear throughout the range. This helps
reduce the influence of those high leverage points.
The log transformation led to a better fitting model overall. The scatter plots illustrate why - the relationship is more linear with log budget compared to raw budget. The transformation improved the fit and interpretation.