R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(stringr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(modelr)
movies <- read.csv("C:/Users/Prasad/Downloads/imdb.csv")
str(movies)
## 'data.frame':    503 obs. of  12 variables:
##  $ names     : chr  "Creed III" "Avatar: The Way of Water" "The Super Mario Bros. Movie" "Mummies" ...
##  $ date_x    : chr  "03-02-2023" "12/15/2022 " "04-05-2023" "01-05-2023" ...
##  $ score     : int  73 78 76 70 61 66 80 83 59 58 ...
##  $ genre     : chr  "Drama,?\xffAction" "Science Fiction,?\xffAdventure,?\xffAction" "Animation,?\xffAdventure,?\xffFamily,?\xffFantasy,?\xffComedy" "Animation,?\xffComedy,?\xffFamily,?\xffAdventure,?\xffFantasy" ...
##  $ overview  : chr  "After dominating the boxing world, Adonis Creed has been thriving in both his career and family life. When a childhood friend a "Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), "While working underground to fix a water main, Brooklyn plumbers\x83??and brothers\x83??Mario and Luigi are transported down a  "Through a series of unfortunate events, three mummies end up in present-day London and embark on a wacky and hilarious journey  ...
##  $ crew      : chr  "Michael B. Jordan, Adonis Creed, Tessa Thompson, Bianca Taylor, Jonathan Majors, Damien Anderson, Wood Harris, Tony 'Little Duk "Sam Worthington, Jake Sully, Zoe Salda?\xf1a, Neytiri, Sigourney Weaver, Kiri / Dr. Grace Augustine, Stephen Lang, Colonel Mile "Chris Pratt, Mario (voice), Anya Taylor-Joy, Princess Peach (voice), Charlie Day, Luigi (voice), Jack Black, Bowser (voice), Ke "??scar Barber?\xadn, Thut (voice), Ana Esther Alborg, Nefer (voice), Luis P??rez Reina, Carnaby (voice), Mar??a Luisa Sol?\xad, ...
##  $ orig_title: chr  "Creed III" "Avatar: The Way of Water" "The Super Mario Bros. Movie" " Momias" ...
##  $ status    : chr  " Released" " Released" " Released" " Released" ...
##  $ orig_lang : chr  " English" " English" " English" " Spanish, Castilian" ...
##  $ budget_x  : num  7.50e+07 4.60e+08 1.00e+08 1.23e+07 7.70e+07 ...
##  $ revenue   : num  2.72e+08 2.32e+09 7.24e+08 3.42e+07 3.41e+08 ...
##  $ country   : chr  "AU" "AU" "AU" "AU" ...
summary(movies)
##     names              date_x              score          genre          
##  Length:503         Length:503         Min.   :  0.0   Length:503        
##  Class :character   Class :character   1st Qu.: 64.0   Class :character  
##  Mode  :character   Mode  :character   Median : 70.0   Mode  :character  
##                                        Mean   : 68.3                     
##                                        3rd Qu.: 76.0                     
##                                        Max.   :100.0                     
##                                        NA's   :64                        
##    overview             crew            orig_title           status         
##  Length:503         Length:503         Length:503         Length:503        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   orig_lang            budget_x            revenue            country         
##  Length:503         Min.   :      105   Min.   :0.000e+00   Length:503        
##  Class :character   1st Qu.: 35511113   1st Qu.:9.562e+07   Class :character  
##  Mode  :character   Median : 97200000   Median :3.441e+08   Mode  :character  
##                     Mean   : 96848059   Mean   :3.998e+08                     
##                     3rd Qu.:136700000   3rd Qu.:5.671e+08                     
##                     Max.   :460000000   Max.   :2.924e+09                     
##                     NA's   :64          NA's   :64
movies <- movies %>%
  mutate(high_rating = as.numeric(score >= 75))
table(movies$high_rating)
## 
##   0   1 
## 304 135

Logistic Regression Model

Let’s build a logistic regression model to predict high rating based on budget and genre. Budget is continuous, while genre is a categorical variable that we’ll need to dummy code.

movies <- movies %>%
  mutate(action = as.numeric(str_detect(genre, "Action")),
         drama = as.numeric(str_detect(genre, "Drama")),
         comedy = as.numeric(str_detect(genre, "Comedy")))

m1 <- glm(high_rating ~ budget_x + action + drama + comedy, data = movies, family = "binomial")

summary(m1)
## 
## Call:
## glm(formula = high_rating ~ budget_x + action + drama + comedy, 
##     family = "binomial", data = movies)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.375e+00  2.431e-01  -5.656 1.55e-08 ***
## budget_x     2.276e-09  1.485e-09   1.533  0.12535    
## action       4.329e-02  2.316e-01   0.187  0.85169    
## drama        6.747e-01  2.535e-01   2.661  0.00779 ** 
## comedy       4.886e-01  2.383e-01   2.050  0.04035 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 541.81  on 438  degrees of freedom
## Residual deviance: 531.02  on 434  degrees of freedom
##   (64 observations deleted due to missingness)
## AIC: 541.02
## 
## Number of Fisher Scoring iterations: 4

Transforming Variables

The budget variable ranges from very small to hundreds of millions. We can try a log transformation to improve the fit.

m2 <- glm(high_rating ~ log(budget_x) + action + drama + comedy, 
          data = movies, family = "binomial")

summary(m2)
## 
## Call:
## glm(formula = high_rating ~ log(budget_x) + action + drama + 
##     comedy, family = "binomial", data = movies)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -1.61430    1.21864  -1.325   0.1853  
## log(budget_x)  0.02620    0.06811   0.385   0.7005  
## action         0.07792    0.23009   0.339   0.7349  
## drama          0.62963    0.25069   2.512   0.0120 *
## comedy         0.47333    0.23915   1.979   0.0478 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 541.81  on 438  degrees of freedom
## Residual deviance: 533.19  on 434  degrees of freedom
##   (64 observations deleted due to missingness)
## AIC: 543.19
## 
## Number of Fisher Scoring iterations: 4

scatter plots would be helpful here to visualize the relationship between budget and high ratings before and after the log transformation.

ggplot(movies, aes(x=budget_x, y=high_rating)) + 
  geom_point(alpha=0.1)
## Warning: Removed 64 rows containing missing values (`geom_point()`).

This shows the relationship between raw budget and high rating. There is a positive correlation, but the large positive budgets on the right side have high leverage.

Now with log budget:

movies <- movies %>% 
  mutate(log_budget = log(budget_x))

ggplot(movies, aes(x=log_budget, y=high_rating)) +
  geom_point(alpha=0.1)
## Warning: Removed 64 rows containing missing values (`geom_point()`).

The log transformation compressed the scale for very large budgets. The trend is still positive but more linear throughout the range. This helps reduce the influence of those high leverage points.

The log transformation led to a better fitting model overall. The scatter plots illustrate why - the relationship is more linear with log budget compared to raw budget. The transformation improved the fit and interpretation.