First we bring in all the libraries we will be using. Then we load the data set we have downloaded.
#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)
library(forcats)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)
library(pwrss)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(effsize)
library(broom)
library(boot)
library(lindia)
#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")
#remove all na's
movies_raw <- movies_raw |>
drop_na(budget)
movies_raw <- movies_raw |>
drop_na(score)
movies_raw <- movies_raw |>
drop_na(runtime)
movies_raw <- movies_raw |>
drop_na(gross)
The next step for our data set is to clean it and format it so that we can begin to work through it.
#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
separate(released, into = c("release_new","country_released"), sep=" \\(") |>
mutate(country_released = str_remove(country_released, "\\)$")) |> #remove the end parathensis
mutate(release_date=mdy(release_new)) |> #then change the date to an easier format
rename(country_filmed=country) #rename column for ease of understanding
movies_
## # A tibble: 5,435 × 17
## name rating genre year release_new country_released score votes director
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 The Sh… R Drama 1980 June 13, 1… United States 8.4 9.27e5 Stanley…
## 2 The Bl… R Adve… 1980 July 2, 19… United States 5.8 6.5 e4 Randal …
## 3 Star W… PG Acti… 1980 June 20, 1… United States 8.7 1.20e6 Irvin K…
## 4 Airpla… PG Come… 1980 July 2, 19… United States 7.7 2.21e5 Jim Abr…
## 5 Caddys… R Come… 1980 July 25, 1… United States 7.3 1.08e5 Harold …
## 6 Friday… R Horr… 1980 May 9, 1980 United States 6.4 1.23e5 Sean S.…
## 7 The Bl… R Acti… 1980 June 20, 1… United States 7.9 1.88e5 John La…
## 8 Raging… R Biog… 1980 December 1… United States 8.2 3.30e5 Martin …
## 9 Superm… PG Acti… 1980 June 19, 1… United States 6.8 1.01e5 Richard…
## 10 The Lo… R Biog… 1980 May 16, 19… United States 7 1 e4 Walter …
## # ℹ 5,425 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## # budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## # release_date <date>
I’m going to make this model using gross (revenue) as my response variable. I am then going to use budget, score, and runtime as my explanatory variables. I feel that this one is interesting because it will be interesting to see if all of them have significant effects on the model or which one seems to have the biggest effect.
model <- lm(gross ~ budget + score + runtime, movies_)
model$coefficients
## (Intercept) budget score runtime
## -2.048041e+08 3.326480e+00 3.574230e+07 -3.718780e+05
summary(model)
##
## Call:
## lm(formula = gross ~ budget + score + runtime, data = movies_)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529487892 -47978542 -10112912 29896911 2045129030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.048e+08 1.258e+07 -16.280 < 2e-16 ***
## budget 3.326e+00 4.199e-02 79.224 < 2e-16 ***
## score 3.574e+07 1.887e+06 18.944 < 2e-16 ***
## runtime -3.719e+05 1.055e+05 -3.524 0.000428 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 121600000 on 5431 degrees of freedom
## Multiple R-squared: 0.5779, Adjusted R-squared: 0.5776
## F-statistic: 2478 on 3 and 5431 DF, p-value: < 2.2e-16
We can see above how the high number of values in the gross variable leads to a very long intercept as well as coefficients for score and runtime. From this we can see that if all of budget, runtime, and score are at 0, the intercept is in the negatives, which isn’t realistic obviously because runtime and budget will always be above a certain amount. Score can be 0, but the average is around 7 so that is also causing the intercept to be so low.
#Residuals vs Fitted Values
gg_resfitted(model) +
geom_smooth(se=FALSE)
#Residuals vs X Values
plots <- gg_resX(model, plot.all = FALSE)
plots$runtime +
geom_smooth(se = FALSE)
plots$budget +
geom_smooth(se = FALSE)
plots$score +
geom_smooth(se = FALSE)
Starting off with the Residual vs Fitted Value chart, we can see that the residuals are clearly not random variance, except near the upper values. We see a heteroscedastitac model especially with it as the lines clump together at the beginning and eventually spread out as it goes on. The x value plots seem a bit different. Run time seems to have a consistent error variance which could point to it not being an issue in the model. Budget and Score seem to have bigger issues though and might need to be changed in the future if we are fixing the model.
Lets look at a Residual Histogram and a QQ-Plot next.
#Residual histogram
gg_reshist(model)
#QQ-Plot
gg_qqplot(model)
Now we can definitely see some discrepancies here the histogram plot is very centrally distributed but seems to have a far right skewness. This could be because the gross revenue of some movies is so high it throws off the rest of our model. This could point to maybe removing some outliers from the data to better fit the model. The QQ-Plot is obviously not normal as well but we can more obviously point out what is wrong. We have a consistent line from around -1.7 to 1.5, but then both sides begin to curve up and down. This shows we have extremes not found in most normal distributions, once again pointing to extreme outliers.
Now we are going to look at Cooks D, because it might show us some of our bigger outliers
gg_cooksd(model, threshold = 'matlab')
We can immediately see that we have a ton of outliers and this makes sense. Some of the movies in the data have grossed more than $1.5 billion dollars which is NOT regular and would not follow a normal distribution. This means we need to look into fixing the outliers in order to correct our model.
I’m going to bring back the summary earlier of the model to better explain budget.
model$coefficients
## (Intercept) budget score runtime
## -2.048041e+08 3.326480e+00 3.574230e+07 -3.718780e+05
summary(model)
##
## Call:
## lm(formula = gross ~ budget + score + runtime, data = movies_)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529487892 -47978542 -10112912 29896911 2045129030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.048e+08 1.258e+07 -16.280 < 2e-16 ***
## budget 3.326e+00 4.199e-02 79.224 < 2e-16 ***
## score 3.574e+07 1.887e+06 18.944 < 2e-16 ***
## runtime -3.719e+05 1.055e+05 -3.524 0.000428 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 121600000 on 5431 degrees of freedom
## Multiple R-squared: 0.5779, Adjusted R-squared: 0.5776
## F-statistic: 2478 on 3 and 5431 DF, p-value: < 2.2e-16
When we look at budget we see that it is 3.33. This means that if we were to increase the budget by $1 we would expect the gross revenue to increase by $3.33. This number looks a lot different compared to the other variables because gross and budget are both monetary values and have very high numbers. Score is only 0-10 and runtime is somewhere between 0-200, meaning increase in just 1 score or minute has a wild effect.
This is pretty good indicator for production companies to put in more money into their movies. That’s an interesting conclusion because that actually is what has been happening in Hollywood recently. We have seen a shift and essentially the “death” of mid-budget movies which often where comedies and rom-coms, and in the last 10 years there has been a noticeable lack of those coming out. Movie studious have seemed to already figure this out as the big studios have focused on 1-2 big $300 million dollar movies a year. On the flip side smaller studios like Blumhouse and A24 have switched to low budget movies that either don’t lose much money if they flop or make a ton of money for the studio if it gets really popular. This can even be seen in the Oscars as Anora won best picture, just on a budget of $6 million dollars, compared to other movies it went up against like Dune 2 which cost $190 million to make.