Hello, my name is Justin Segrest and I am a current student at Indiana University - Indianapolis. This project is for the class Statistics in Data Science and is aimed analyzing and presenting a data set that is important to us.
My data set is taken from a Kaggle data set known as "Movie Industry." This data set was made using the IMDB database of movies and contains 200 movies from each year between 1985-2019. I was interested in using this data set as I have a deep interest and appreciation of movies and wanted to see if I could better understand how the industry has gotten to where we are today. Many people complain today about the state of the movie industry, with original ideas becoming less frequent and franchise/remake movies taking over. Has this lead to a decrease in actual ratings, or is this just a cultural idea and not reflected in the data.
My audience for this presentation is essentially Hollywood. When I say that I am referring to promient production companies and the people who run them. Some examples of this include Warner Bros, Paramount Pictures, Lucas-films and many others. This data set can have influential repercussions in the movie industry as this can help executives better understand what has worked in the past and what is working now.
Purpose
This leads me to my purpose for this project. I want to learn how the average scores of movies are affected by various variables. Does the budget given to a movie have a direct relationship with the eventual score the movie receives? Does this apply to other variables including runtime and gross revenue. By the end of my analysis I want to be able to present an answer to this question and hopefully provide valuable information that can be expanded upon in the future.
Analysis
Cleaning the data
Before I begin cleaning the data I must load in all of the neccessary libraries.
#Load in Librarieslibrary(tidyr)library(readr)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(forcats)library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
library(stringr)library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(ggplot2)library(scales)
Attaching package: 'scales'
The following object is masked from 'package:readr':
col_factor
library(pwrss)
Attaching package: 'pwrss'
The following object is masked from 'package:stats':
power.t.test
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ scales::col_factor() masks readr::col_factor()
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
######################### Warning from 'xts' package ##########################
# #
# The dplyr lag() function breaks how base R's lag() function is supposed to #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
# source() into this session won't work correctly. #
# #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
# dplyr from breaking base R's lag() function. #
# #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
# #
###############################################################################
Attaching package: 'xts'
The following objects are masked from 'package:dplyr':
first, last
library(tsibble)
Registered S3 method overwritten by 'tsibble':
method from
as_tibble.grouped_df dplyr
Attaching package: 'tsibble'
The following object is masked from 'package:zoo':
index
The following object is masked from 'package:lubridate':
interval
The following objects are masked from 'package:base':
intersect, setdiff, union
Now I can begin to tidy the data.
#Load in the datasetmovies_raw <-read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")
Rows: 7668 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): name, rating, genre, released, director, writer, star, country, com...
dbl (6): year, score, votes, budget, gross, runtime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#remove all na'smovies_raw <- movies_raw |>drop_na(score)movies_raw <- movies_raw |>drop_na(runtime)movies_raw <- movies_raw |>drop_na(budget)movies_raw <- movies_raw |>drop_na(gross)#convert gross and budget columns to show as millionsmovies_raw$budget_m <-round(movies_raw$budget /1e6, 2)movies_raw$gross_m <-round(movies_raw$gross /1e6 , 2)#turn movies_raw into an easier data framemovies <- movies_raw
There is a lot to unpack from this above tidying. The first thing I do once loading in the data is removing NA’s from the columns I am going to work on later. The NA’s in this dataset are all from data missing from the IMDB website and not implicit or explicit. This will allow my future data analysis to be as accurate as possible. The next step was changing the gross and budget columns into an easier to read format which meant changing them to represent a 1.00 in the gross column as 1 million. This will allow for easier to understand values later. And finally I change the movies_raw data frame into movies, making the future analysis easier to understand.
Chosen Variables
I chose to make score my main variable and my main point of analysis. This was chosen as I feel that the score of a movie is indicative of the movie’s success at the time and also how it is held up over time. It also doesn’t have to deal with issues such as inflation which makes it much easier to understand when analyzing.
My other variables that I am looking at to better understand score are runtime, budget, and gross (gross revenue). I chose runtime as this is a common complaint among many movie goers, specifically a movie being “too long.” It will be interesting to see if there is a negative or positive correlation between the two. I chose budget as it is indicative of how much a studio trusted the movie to succeed, whether it was the IP, director, or any other factors. I expect movie’s with a higher budget to be more successful than others with a lower budget. Finally, I chose gross revenue and this is the most important factor for many production companies. How much profit is the production company going to make and how does this affect the score of a movie. My assumption would be that the highest grossing movies are heavily associated with the highest scored movies.
Starter Models
Scatterplots
#Scatterplot of Score and Runtimeggplot(movies, aes(x = score, y = runtime)) +geom_point(color ="steelblue", alpha =0.6) +labs(title ="Movie Score vs. Runtime",x ="Score",y ="Runtime (min)") +theme_minimal() +geom_smooth(method ="lm", color ="darkred", se =FALSE)
`geom_smooth()` using formula = 'y ~ x'
#Scatterplot of Score and Runtimeggplot(movies, aes(x = score, y = budget_m)) +geom_point(color ="purple", alpha =0.6) +labs(title ="Movie Score vs. Budget",x ="Score",y ="Budget (millions)") +theme_minimal() +geom_smooth(method ="lm", color ="darkred", se =FALSE)
`geom_smooth()` using formula = 'y ~ x'
#Scatterplot of Score and Runtimeggplot(movies, aes(x = score, y = gross_m)) +geom_point(color ="green", alpha =0.6) +labs(title ="Movie Score vs. Gross Revenue",x ="Score",y ="Gross Revenue (millions)") +theme_minimal() +geom_smooth(method ="lm", color ="darkred", se =FALSE)
`geom_smooth()` using formula = 'y ~ x'
#Create a model for the variablesmovie_model <-lm(score ~ runtime + budget_m + gross_m, data = movies)#View the model summarysummary(movie_model)
Call:
lm(formula = score ~ runtime + budget_m + gross_m, data = movies)
Residuals:
Min 1Q Median 3Q Max
-4.1632 -0.4673 0.0678 0.5550 2.5502
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.053e+00 7.091e-02 57.16 <2e-16 ***
runtime 2.238e-02 6.703e-04 33.39 <2e-16 ***
budget_m -7.227e-03 4.180e-04 -17.29 <2e-16 ***
gross_m 1.734e-03 9.154e-05 18.94 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8471 on 5431 degrees of freedom
Multiple R-squared: 0.227, Adjusted R-squared: 0.2266
F-statistic: 531.6 on 3 and 5431 DF, p-value: < 2.2e-16