Project - Statisitics in Data Science

Author

Justin Segrest

Introduction/Background

Hello, my name is Justin Segrest and I am a current student at Indiana University - Indianapolis. This project is for the class Statistics in Data Science and is aimed analyzing and presenting a data set that is important to us.

My data set is taken from a Kaggle data set known as "Movie Industry." This data set was made using the IMDB database of movies and contains 200 movies from each year between 1985-2019. I was interested in using this data set as I have a deep interest and appreciation of movies and wanted to see if I could better understand how the industry has gotten to where we are today. Many people complain today about the state of the movie industry, with original ideas becoming less frequent and franchise/remake movies taking over. Has this lead to a decrease in actual ratings, or is this just a cultural idea and not reflected in the data.

My audience for this presentation is essentially Hollywood. When I say that I am referring to promient production companies and the people who run them. Some examples of this include Warner Bros, Paramount Pictures, Lucas-films and many others. This data set can have influential repercussions in the movie industry as this can help executives better understand what has worked in the past and what is working now.

Purpose

This leads me to my purpose for this project. I want to learn how the average scores of movies are affected by various variables. Does the budget given to a movie have a direct relationship with the eventual score the movie receives? Does this apply to other variables including runtime and gross revenue. By the end of my analysis I want to be able to present an answer to this question and hopefully provide valuable information that can be expanded upon in the future.

Analysis

Cleaning the data

Before I begin cleaning the data I must load in all of the neccessary libraries.

#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(forcats)
library(lubridate)


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library(stringr)
library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(ggplot2)
library(scales)


Attaching package: 'scales'

The following object is masked from 'package:readr':

    col_factor

library(pwrss)


Attaching package: 'pwrss'

The following object is masked from 'package:stats':

    power.t.test

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ purrr  1.0.4     ✔ tibble 3.2.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ scales::col_factor() masks readr::col_factor()
✖ purrr::discard()     masks scales::discard()
✖ dplyr::filter()      masks stats::filter()
✖ dplyr::lag()         masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(effsize)
library(broom)
library(boot)
library(lindia)
library(xts)

Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric


######################### Warning from 'xts' package ##########################
#                                                                             #
# The dplyr lag() function breaks how base R's lag() function is supposed to  #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
# source() into this session won't work correctly.                            #
#                                                                             #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
# dplyr from breaking base R's lag() function.                                #
#                                                                             #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
#                                                                             #
###############################################################################

Attaching package: 'xts'

The following objects are masked from 'package:dplyr':

    first, last

library(tsibble)

Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr

Attaching package: 'tsibble'

The following object is masked from 'package:zoo':

    index

The following object is masked from 'package:lubridate':

    interval

The following objects are masked from 'package:base':

    intersect, setdiff, union

Now I can begin to tidy the data.

#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")

Rows: 7668 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): name, rating, genre, released, director, writer, star, country, com...
dbl (6): year, score, votes, budget, gross, runtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#remove all na's
movies_raw <- movies_raw |>
  drop_na(score)

movies_raw <- movies_raw |>
  drop_na(runtime)

movies_raw <- movies_raw |>
  drop_na(budget)

movies_raw <- movies_raw |>
  drop_na(gross)

#convert gross and budget columns to show as millions
movies_raw$budget_m <- round(movies_raw$budget / 1e6, 2)
movies_raw$gross_m <- round(movies_raw$gross / 1e6 , 2)
  
#turn movies_raw into an easier data frame
movies <- movies_raw

There is a lot to unpack from this above tidying. The first thing I do once loading in the data is removing NA’s from the columns I am going to work on later. The NA’s in this dataset are all from data missing from the IMDB website and not implicit or explicit. This will allow my future data analysis to be as accurate as possible. The next step was changing the gross and budget columns into an easier to read format which meant changing them to represent a 1.00 in the gross column as 1 million. This will allow for easier to understand values later. And finally I change the movies_raw data frame into movies, making the future analysis easier to understand.

Chosen Variables

I chose to make score my main variable and my main point of analysis. This was chosen as I feel that the score of a movie is indicative of the movie’s success at the time and also how it is held up over time. It also doesn’t have to deal with issues such as inflation which makes it much easier to understand when analyzing.

My other variables that I am looking at to better understand score are runtime, budget, and gross (gross revenue). I chose runtime as this is a common complaint among many movie goers, specifically a movie being “too long.” It will be interesting to see if there is a negative or positive correlation between the two. I chose budget as it is indicative of how much a studio trusted the movie to succeed, whether it was the IP, director, or any other factors. I expect movie’s with a higher budget to be more successful than others with a lower budget. Finally, I chose gross revenue and this is the most important factor for many production companies. How much profit is the production company going to make and how does this affect the score of a movie. My assumption would be that the highest grossing movies are heavily associated with the highest scored movies.

Starter Models

Scatterplots

#Scatterplot of Score and Runtime
ggplot(movies, aes(x = score, y = runtime)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  labs(title = "Movie Score vs. Runtime",
       x = "Score",
       y = "Runtime (min)") +
  theme_minimal() +
  geom_smooth(method = "lm", color = "darkred", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

#Scatterplot of Score and Runtime
ggplot(movies, aes(x = score, y = budget_m)) +
  geom_point(color = "purple", alpha = 0.6) +
  labs(title = "Movie Score vs. Budget",
       x = "Score",
       y = "Budget (millions)") +
  theme_minimal() +
  geom_smooth(method = "lm", color = "darkred", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

#Scatterplot of Score and Runtime
ggplot(movies, aes(x = score, y = gross_m)) +
  geom_point(color = "green", alpha = 0.6) +
  labs(title = "Movie Score vs. Gross Revenue",
       x = "Score",
       y = "Gross Revenue (millions)") +
  theme_minimal() +
  geom_smooth(method = "lm", color = "darkred", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

#Create a model for the variables
movie_model <- lm(score ~ runtime + budget_m + gross_m, data = movies)

#View the model summary
summary(movie_model)


Call:
lm(formula = score ~ runtime + budget_m + gross_m, data = movies)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1632 -0.4673  0.0678  0.5550  2.5502 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.053e+00  7.091e-02   57.16   <2e-16 ***
runtime      2.238e-02  6.703e-04   33.39   <2e-16 ***
budget_m    -7.227e-03  4.180e-04  -17.29   <2e-16 ***
gross_m      1.734e-03  9.154e-05   18.94   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8471 on 5431 degrees of freedom
Multiple R-squared:  0.227, Adjusted R-squared:  0.2266 
F-statistic: 531.6 on 3 and 5431 DF,  p-value: < 2.2e-16