Setting up R and Loading Data set

First we bring in all the libraries we will be using. Then we load the data set we have downloaded.

#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)
library(forcats)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)
library(pwrss)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(effsize)
library(broom)
library(boot)
library(lindia)

#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")

#remove all na's
movies_raw <- movies_raw |>
  drop_na(budget)

movies_raw <- movies_raw |>
  drop_na(gross)

The next step for our data set is to clean it and format it so that we can begin to work through it.

#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
  separate(released, into = c("release_new","country_released"), sep=" \\(") |>
  mutate(country_released = str_remove(country_released, "\\)$")) |>    #remove the end parathensis
  mutate(release_date=mdy(release_new)) |>         #then change the date to an easier format
  rename(country_filmed=country)            #rename column for ease of understanding
  
movies_
## # A tibble: 5,436 × 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 The Sh… R      Drama  1980 June 13, 1… United States      8.4 9.27e5 Stanley…
##  2 The Bl… R      Adve…  1980 July 2, 19… United States      5.8 6.5 e4 Randal …
##  3 Star W… PG     Acti…  1980 June 20, 1… United States      8.7 1.20e6 Irvin K…
##  4 Airpla… PG     Come…  1980 July 2, 19… United States      7.7 2.21e5 Jim Abr…
##  5 Caddys… R      Come…  1980 July 25, 1… United States      7.3 1.08e5 Harold …
##  6 Friday… R      Horr…  1980 May 9, 1980 United States      6.4 1.23e5 Sean S.…
##  7 The Bl… R      Acti…  1980 June 20, 1… United States      7.9 1.88e5 John La…
##  8 Raging… R      Biog…  1980 December 1… United States      8.2 3.30e5 Martin …
##  9 Superm… PG     Acti…  1980 June 19, 1… United States      6.8 1.01e5 Richard…
## 10 The Lo… R      Biog…  1980 May 16, 19… United States      7   1   e4 Walter …
## # ℹ 5,426 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

Selecting a Binary Variable for our Model

I am going to choose the gross variable for this data dive. I want to look at movies that were successful or unsuccessful and I will base it on if the movie was above or below its budget.

#add binary column for gross revenue
movies_ <- movies_ %>% 
  mutate(successful = ifelse(gross > budget, 1, 0))

Creating the Logistic Regression Model

We need to create the model and with it add exploratory variables

#plot with budget
movies_ |>
  ggplot(mapping = aes(x = budget, y = successful)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) +
  labs(title = "Modeling a Binary Response with OLS") +
  theme_minimal()

#plot with runtime
movies_ |>
  ggplot(mapping = aes(x = runtime, y = successful)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) +
  labs(title = "Modeling a Binary Response with OLS") +
  theme_minimal()

#plot with score
movies_ |>
  ggplot(mapping = aes(x = score, y = successful)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) +
  labs(title = "Modeling a Binary Response with OLS") +
  theme_minimal()

Above we can see some plots that show what different explanatory variables have an effect on the success variable of movies. I chose budget, runtime, and score as all of these can affect the performance of a movie and people’s enjoyment of it. This leads to more or less people going to see it and in the end the gross being higher or lower.

model <- glm(successful ~ score, data = movies_,
             family = binomial(link = 'logit'))

model$coefficients
## (Intercept)       score 
##  -2.3609088   0.4908926

For the first model I just used score to see what it would look like and we can see that the intercept has a definite change based on the score variable.

# these coefficients come from the model
sigmoid <- \(x) 1 / (1 + exp(-(-2.361 + 0.491 * x)))

movies_ |>
  ggplot(mapping = aes(x = score, y = successful)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Modeling a Binary Response with Sigmoid") +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  theme_minimal()

In the above plot we can see how an updated change in the sigmoid function changes the curve for the score. Now below I will add the other 3 explanatory variables and see what happens to the model.

model <- glm(successful ~ score + budget + runtime, data = movies_,
             family = binomial(link = 'logit'))

model$coefficients
##   (Intercept)         score        budget       runtime 
## -1.844022e+00  6.133175e-01  1.543885e-08 -1.649141e-02

So with the above coefficients we can see that the intercept is -1.844. This means that if all variables are at 0. the successful variable is at -1.844, meaning it most likely is not successful. For every increase in score and budget we expect the intercept to increase by 0.613 and 0.000000154 respectively. For every increase in the runtime though we would expect it to decrease by -0.0164.

summary(model)
## 
## Call:
## glm(formula = successful ~ score + budget + runtime, family = binomial(link = "logit"), 
##     data = movies_)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5329  -1.2468   0.6957   0.8952   2.0123  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.844e+00  2.329e-01  -7.917 2.43e-15 ***
## score        6.133e-01  3.676e-02  16.686  < 2e-16 ***
## budget       1.544e-08  1.143e-09  13.507  < 2e-16 ***
## runtime     -1.649e-02  2.040e-03  -8.083 6.32e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6831.7  on 5434  degrees of freedom
## Residual deviance: 6320.1  on 5431  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 6328.1
## 
## Number of Fisher Scoring iterations: 4
confint(model, parm = "score", level = 0.95)
##     2.5 %    97.5 % 
## 0.5418049 0.6859245

We can see that the confidence interval ends up being (0.542, 0.686). This means that we are 95% sure that the effect of a 1% increase in score increases the successful rate of movies. This leads us to think that higher score movies have a higher chance of producing successful movies.