This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(stringr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(modelr)
movies <- read.csv("C:/Users/Prasad/Downloads/imdb.csv")
str(movies)
## 'data.frame': 503 obs. of 12 variables:
## $ names : chr "Creed III" "Avatar: The Way of Water" "The Super Mario Bros. Movie" "Mummies" ...
## $ date_x : chr "03-02-2023" "12/15/2022 " "04-05-2023" "01-05-2023" ...
## $ score : int 73 78 76 70 61 66 80 83 59 58 ...
## $ genre : chr "Drama,?\xffAction" "Science Fiction,?\xffAdventure,?\xffAction" "Animation,?\xffAdventure,?\xffFamily,?\xffFantasy,?\xffComedy" "Animation,?\xffComedy,?\xffFamily,?\xffAdventure,?\xffFantasy" ...
## $ overview : chr "After dominating the boxing world, Adonis Creed has been thriving in both his career and family life. When a childhood friend a "Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), "While working underground to fix a water main, Brooklyn plumbers\x83??and brothers\x83??Mario and Luigi are transported down a "Through a series of unfortunate events, three mummies end up in present-day London and embark on a wacky and hilarious journey ...
## $ crew : chr "Michael B. Jordan, Adonis Creed, Tessa Thompson, Bianca Taylor, Jonathan Majors, Damien Anderson, Wood Harris, Tony 'Little Duk "Sam Worthington, Jake Sully, Zoe Salda?\xf1a, Neytiri, Sigourney Weaver, Kiri / Dr. Grace Augustine, Stephen Lang, Colonel Mile "Chris Pratt, Mario (voice), Anya Taylor-Joy, Princess Peach (voice), Charlie Day, Luigi (voice), Jack Black, Bowser (voice), Ke "??scar Barber?\xadn, Thut (voice), Ana Esther Alborg, Nefer (voice), Luis P??rez Reina, Carnaby (voice), Mar??a Luisa Sol?\xad, ...
## $ orig_title: chr "Creed III" "Avatar: The Way of Water" "The Super Mario Bros. Movie" " Momias" ...
## $ status : chr " Released" " Released" " Released" " Released" ...
## $ orig_lang : chr " English" " English" " English" " Spanish, Castilian" ...
## $ budget_x : num 7.50e+07 4.60e+08 1.00e+08 1.23e+07 7.70e+07 ...
## $ revenue : num 2.72e+08 2.32e+09 7.24e+08 3.42e+07 3.41e+08 ...
## $ country : chr "AU" "AU" "AU" "AU" ...
summary(movies)
## names date_x score genre
## Length:503 Length:503 Min. : 0.0 Length:503
## Class :character Class :character 1st Qu.: 64.0 Class :character
## Mode :character Mode :character Median : 70.0 Mode :character
## Mean : 68.3
## 3rd Qu.: 76.0
## Max. :100.0
## NA's :64
## overview crew orig_title status
## Length:503 Length:503 Length:503 Length:503
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## orig_lang budget_x revenue country
## Length:503 Min. : 105 Min. :0.000e+00 Length:503
## Class :character 1st Qu.: 35511113 1st Qu.:9.562e+07 Class :character
## Mode :character Median : 97200000 Median :3.441e+08 Mode :character
## Mean : 96848059 Mean :3.998e+08
## 3rd Qu.:136700000 3rd Qu.:5.671e+08
## Max. :460000000 Max. :2.924e+09
## NA's :64 NA's :64
movies <- movies %>%
mutate(high_rating = as.numeric(score >= 75))
table(movies$high_rating)
##
## 0 1
## 304 135
Logistic Regression Model
Let’s build a logistic regression model to predict high rating based on budget and genre. Budget is continuous, while genre is a categorical variable that we’ll need to dummy code.
movies <- movies %>%
mutate(action = as.numeric(str_detect(genre, "Action")),
drama = as.numeric(str_detect(genre, "Drama")),
comedy = as.numeric(str_detect(genre, "Comedy")))
m1 <- glm(high_rating ~ budget_x + action + drama + comedy, data = movies, family = "binomial")
summary(m1)
##
## Call:
## glm(formula = high_rating ~ budget_x + action + drama + comedy,
## family = "binomial", data = movies)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.375e+00 2.431e-01 -5.656 1.55e-08 ***
## budget_x 2.276e-09 1.485e-09 1.533 0.12535
## action 4.329e-02 2.316e-01 0.187 0.85169
## drama 6.747e-01 2.535e-01 2.661 0.00779 **
## comedy 4.886e-01 2.383e-01 2.050 0.04035 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 541.81 on 438 degrees of freedom
## Residual deviance: 531.02 on 434 degrees of freedom
## (64 observations deleted due to missingness)
## AIC: 541.02
##
## Number of Fisher Scoring iterations: 4
Transforming Variables
The budget variable ranges from very small to hundreds of millions. We can try a log transformation to improve the fit.
m2 <- glm(high_rating ~ log(budget_x) + action + drama + comedy,
data = movies, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_rating ~ log(budget_x) + action + drama +
## comedy, family = "binomial", data = movies)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.61430 1.21864 -1.325 0.1853
## log(budget_x) 0.02620 0.06811 0.385 0.7005
## action 0.07792 0.23009 0.339 0.7349
## drama 0.62963 0.25069 2.512 0.0120 *
## comedy 0.47333 0.23915 1.979 0.0478 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 541.81 on 438 degrees of freedom
## Residual deviance: 533.19 on 434 degrees of freedom
## (64 observations deleted due to missingness)
## AIC: 543.19
##
## Number of Fisher Scoring iterations: 4
scatter plots would be helpful here to visualize the relationship between budget and high ratings before and after the log transformation.
ggplot(movies, aes(x=budget_x, y=high_rating)) +
geom_point(alpha=0.1)
## Warning: Removed 64 rows containing missing values (`geom_point()`).
This shows the relationship between raw budget and high rating. There is a positive correlation, but the large positive budgets on the right side have high leverage.
Now with log budget:
movies <- movies %>%
mutate(log_budget = log(budget_x))
ggplot(movies, aes(x=log_budget, y=high_rating)) +
geom_point(alpha=0.1)
## Warning: Removed 64 rows containing missing values (`geom_point()`).
The log transformation compressed the scale for very large budgets. The
trend is still positive but more linear throughout the range. This helps
reduce the influence of those high leverage points.
The log transformation led to a better fitting model overall. The scatter plots illustrate why - the relationship is more linear with log budget compared to raw budget. The transformation improved the fit and interpretation.