Data Dive Week 7

Setting up R and Loading Data set

First we bring in all the libraries we will be using. Then we load the data set we have downloaded.

#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)
library(forcats)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)
library(pwrss)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(effsize)

#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")

#remove all na's
movies_raw <- movies_raw |>
  drop_na(score)

movies_raw <- movies_raw |>
  drop_na(rating)

movies_raw <- movies_raw |>
  drop_na(runtime)

movies_raw <- movies_raw |>
  drop_na(budget)

The next step for our data set is to clean it and format it so that we can begin to work through it.

#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
  separate(released, into = c("release_new","country_released"), sep=" \\(") |>
  mutate(country_released = str_remove(country_released, "\\)$")) |>    #remove the end parathensis
  mutate(release_date=mdy(release_new)) |>         #then change the date to an easier format
  rename(country_filmed=country)            #rename column for ease of understanding
  
movies_

## # A tibble: 5,475 × 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 The Sh… R      Drama  1980 June 13, 1… United States      8.4 9.27e5 Stanley…
##  2 The Bl… R      Adve…  1980 July 2, 19… United States      5.8 6.5 e4 Randal …
##  3 Star W… PG     Acti…  1980 June 20, 1… United States      8.7 1.20e6 Irvin K…
##  4 Airpla… PG     Come…  1980 July 2, 19… United States      7.7 2.21e5 Jim Abr…
##  5 Caddys… R      Come…  1980 July 25, 1… United States      7.3 1.08e5 Harold …
##  6 Friday… R      Horr…  1980 May 9, 1980 United States      6.4 1.23e5 Sean S.…
##  7 The Bl… R      Acti…  1980 June 20, 1… United States      7.9 1.88e5 John La…
##  8 Raging… R      Biog…  1980 December 1… United States      8.2 3.30e5 Martin …
##  9 Superm… PG     Acti…  1980 June 19, 1… United States      6.8 1.01e5 Richard…
## 10 The Lo… R      Biog…  1980 May 16, 19… United States      7   1   e4 Walter …
## # ℹ 5,465 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

Picking our most “valuable” continuous variable

For my data set I think the score variable is the most valuable to both consumers and producers. When choosing to watch a movie the score is the first thing you want to know so that you can see if it is even worth your time for watching. Movie producers are very interested in the score as they want to their movies to be as good as possible.

Picking a categorical column to analyze with it

I chose rating as the categorical column I wanted to analyze with the score. Do R rated movies have a bigger impact on the score than a PG-13 rating? There needs to be some work done on the column so that we can have the best analysis possible.

#see how many movies are in each category
table(movies_$rating)

## 
##  Approved         G     NC-17 Not Rated        PG     PG-13         R     TV-MA 
##         1       111        12        48       918      1736      2628         2 
##   Unrated         X 
##        18         1

movies_ %>%
  count(rating)

## # A tibble: 10 × 2
##    rating        n
##    <chr>     <int>
##  1 Approved      1
##  2 G           111
##  3 NC-17        12
##  4 Not Rated    48
##  5 PG          918
##  6 PG-13      1736
##  7 R          2628
##  8 TV-MA         2
##  9 Unrated      18
## 10 X             1

#limit the rating's to just G, NC-17, Not rated, PG, PG-13, R, and Unrated
wanted_ratings <- c("G", "NC-17", "Not Rated", "PG", "PG-13", "R", "Unrated")

movies_ <- movies_ %>%
  filter(rating %in% wanted_ratings)

I decided to only take the 7 most used ratings to analyze the two different variables. This way we can see an actual effect and account for the size of the smaller categories.

ANOVA Testing

For my null hypothesis I created:

Ho - Average movie score is equal across all rating types

Ha - The average movie score is NOT equal across all rating types

#anova test
m <- aov(score ~ rating, data = movies_)
summary(m)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## rating         6     87  14.554   15.83 <2e-16 ***
## Residuals   5464   5024   0.919                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our p-value is very low and IS significant. This means that we can reject the null hypothesis. Explaining this to someone not familiar with my data set or anova testing, I would say that we can assume that there may be a difference in the score of a movie based on its rating. Below I have a box plot of the different ratings and the average scores. We can see that there is a noticeable difference between the rating categories.

movies_ |>
  ggplot() +
  geom_boxplot(mapping = aes(y = score, x = rating)) +
  labs(x = "Movie Rating",
       y = "Movie Score (IMDB)")

Moving on to Linear Regression

First we need to see if there are any continuous variables that seem to show a linear relationship with the score of a movie. I will look first at runtime, then budget and choose the one that looks to show the most correlation.

#scatterplot with runtime
movies_ |>
  ggplot(mapping = aes(x = runtime, y = score)) +
  geom_point(size = 2, color = 'darkblue')

#scatterplot with budget
movies_ |>
  ggplot(mapping = aes(x = budget, y = score)) +
  geom_point(size = 2, color = 'darkblue')

It is hard to tell just from the scatter plots above which has a greater linear relationship, so I am going to add a trend line to help show the stronger relationship

#runtime
movies_ |>
  ggplot(mapping = aes(x = runtime, y = score)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue')

#budget
movies_ |>
  ggplot(mapping = aes(x = budget, y = score)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue')

We can see that runtime and score have a much more positive linear relationship. I will use runtime going forward with my analysis.

model <- lm(score ~ runtime, movies_)
model$coefficients

## (Intercept)     runtime 
##  3.98050679  0.02226354

This above model shows that for whenever runtime is , the score is at 3.98. For every unit (minute) the runtime increase, the score will increase by 0.02. This is actually really interesting when it is contextualized in the data set. Could this be that longer movies are harder to green light and make so only great ones break through and are made. I think another explanation could be that a lot of bad movies don’t have long run times (this is because of budget, not having a long script, etc..) so it could just be that there are more movies on the higher scale fro runtime that have a higher average score.