R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(stringr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(modelr)
movies <- read.csv("C:/Users/DELL/Downloads/imdb.csv")
str(movies)
## 'data.frame':    377 obs. of  12 variables:
##  $ country   : chr  "Creed III" "Avatar: The Way of Water" "The Super Mario Bros. Movie" "Mummies" ...
##  $ date_x    : chr  "03-02-2023" "12/15/2022 " "04-05-2023" "01-05-2023" ...
##  $ score     : int  73 78 76 70 61 66 80 83 59 58 ...
##  $ genre     : chr  "Drama,?\xffAction" "Science Fiction,?\xffAdventure,?\xffAction" "Animation,?\xffAdventure,?\xffFamily,?\xffFantasy,?\xffComedy" "Animation,?\xffComedy,?\xffFamily,?\xffAdventure,?\xffFantasy" ...
##  $ overview  : chr  "After dominating the boxing world, Adonis Creed has been thriving in both his career and family life. When a childhood friend a "Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), "While working underground to fix a water main, Brooklyn plumbers\x83??and brothers\x83??Mario and Luigi are transported down a  "Through a series of unfortunate events, three mummies end up in present-day London and embark on a wacky and hilarious journey  ...
##  $ crew      : chr  "Michael B. Jordan, Adonis Creed, Tessa Thompson, Bianca Taylor, Jonathan Majors, Damien Anderson, Wood Harris, Tony 'Little Duk "Sam Worthington, Jake Sully, Zoe Salda?\xf1a, Neytiri, Sigourney Weaver, Kiri / Dr. Grace Augustine, Stephen Lang, Colonel Mile "Chris Pratt, Mario (voice), Anya Taylor-Joy, Princess Peach (voice), Charlie Day, Luigi (voice), Jack Black, Bowser (voice), Ke "??scar Barber?\xadn, Thut (voice), Ana Esther Alborg, Nefer (voice), Luis P??rez Reina, Carnaby (voice), Mar??a Luisa Sol?\xad, ...
##  $ orig_title: chr  "Creed III" "Avatar: The Way of Water" "The Super Mario Bros. Movie" " Momias" ...
##  $ status    : chr  " Released" " Released" " Released" " Released" ...
##  $ orig_lang : chr  " English" " English" " English" " Spanish, Castilian" ...
##  $ budget_x  : num  7.50e+07 4.60e+08 1.00e+08 1.23e+07 7.70e+07 ...
##  $ revenue   : num  2.72e+08 2.32e+09 7.24e+08 3.42e+07 3.41e+08 ...
##  $ X         : chr  "AU" "AU" "AU" "AU" ...
summary(movies)
##    country             date_x              score           genre          
##  Length:377         Length:377         Min.   :  0.00   Length:377        
##  Class :character   Class :character   1st Qu.: 63.00   Class :character  
##  Mode  :character   Mode  :character   Median : 70.00   Mode  :character  
##                                        Mean   : 68.26                     
##                                        3rd Qu.: 76.00                     
##                                        Max.   :100.00                     
##    overview             crew            orig_title           status         
##  Length:377         Length:377         Length:377         Length:377        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   orig_lang            budget_x            revenue               X            
##  Length:377         Min.   :      105   Min.   :0.000e+00   Length:377        
##  Class :character   1st Qu.: 38211149   1st Qu.:1.020e+08   Class :character  
##  Mode  :character   Median : 92600000   Median :3.526e+08   Mode  :character  
##                     Mean   : 96816280   Mean   :4.080e+08                     
##                     3rd Qu.:136400000   3rd Qu.:5.645e+08                     
##                     Max.   :460000000   Max.   :2.924e+09
movies <- movies %>%
  mutate(high_rating = as.numeric(score >= 75))
table(movies$high_rating)
## 
##   0   1 
## 263 114

Logistic Regression Model

Let’s build a logistic regression model to predict high rating based on budget and genre. Budget is continuous, while genre is a categorical variable that we’ll need to dummy code.

movies <- movies %>%
  mutate(action = as.numeric(str_detect(genre, "Action")),
         drama = as.numeric(str_detect(genre, "Drama")),
         comedy = as.numeric(str_detect(genre, "Comedy")))

m1 <- glm(high_rating ~ budget_x + action + drama + comedy, data = movies, family = "binomial")

summary(m1)
## 
## Call:
## glm(formula = high_rating ~ budget_x + action + drama + comedy, 
##     family = "binomial", data = movies)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.475e+00  2.669e-01  -5.526 3.27e-08 ***
## budget_x     3.356e-09  1.593e-09   2.108   0.0351 *  
## action      -1.196e-01  2.528e-01  -0.473   0.6362    
## drama        6.731e-01  2.782e-01   2.419   0.0155 *  
## comedy       5.788e-01  2.598e-01   2.228   0.0259 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.11  on 376  degrees of freedom
## Residual deviance: 449.39  on 372  degrees of freedom
## AIC: 459.39
## 
## Number of Fisher Scoring iterations: 4

Transforming Variables

The budget variable ranges from very small to hundreds of millions. We can try a log transformation to improve the fit.

m2 <- glm(high_rating ~ log(budget_x) + action + drama + comedy, 
          data = movies, family = "binomial")

summary(m2)
## 
## Call:
## glm(formula = high_rating ~ log(budget_x) + action + drama + 
##     comedy, family = "binomial", data = movies)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -2.28059    1.36635  -1.669   0.0951 .
## log(budget_x)  0.06471    0.07606   0.851   0.3949  
## action        -0.07590    0.24990  -0.304   0.7613  
## drama          0.60896    0.27424   2.221   0.0264 *
## comedy         0.52688    0.25875   2.036   0.0417 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.11  on 376  degrees of freedom
## Residual deviance: 453.01  on 372  degrees of freedom
## AIC: 463.01
## 
## Number of Fisher Scoring iterations: 4

scatter plots would be helpful here to visualize the relationship between budget and high ratings before and after the log transformation.

ggplot(movies, aes(x=budget_x, y=high_rating)) + 
  geom_point(alpha=0.1)

This shows the relationship between raw budget and high rating. There is a positive correlation, but the large positive budgets on the right side have high leverage.

Now with log budget:

movies <- movies %>% 
  mutate(log_budget = log(budget_x))

ggplot(movies, aes(x=log_budget, y=high_rating)) +
  geom_point(alpha=0.1)

The log transformation compressed the scale for very large budgets. The trend is still positive but more linear throughout the range. This helps reduce the influence of those high leverage points.

The log transformation led to a better fitting model overall. The scatter plots illustrate why - the relationship is more linear with log budget compared to raw budget. The transformation improved the fit and interpretation.