Data Dive Week 6

Setting up R and Loading Data set

First we bring in all the libraries we will be using. Then we load the data set we have downloaded.

#Load in Libraries
library(tidyr)
library(readr)
library(dplyr)
library(forcats)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)
library(boot)

#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")

The next step for our data set is to clean it and format it so that we can begin to work through it.

#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
  separate(released, into = c("release_new","country_released"), sep=" \\(") |>
  mutate(country_released = str_remove(country_released, "\\)$")) |>    #remove the end parathensis
  mutate(release_date=mdy(release_new)) |>         #then change the date to an easier format
  rename(country_filmed=country)            #rename column for ease of understanding
  
movies_

## # A tibble: 7,668 × 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 The Sh… R      Drama  1980 June 13, 1… United States      8.4 9.27e5 Stanley…
##  2 The Bl… R      Adve…  1980 July 2, 19… United States      5.8 6.5 e4 Randal …
##  3 Star W… PG     Acti…  1980 June 20, 1… United States      8.7 1.20e6 Irvin K…
##  4 Airpla… PG     Come…  1980 July 2, 19… United States      7.7 2.21e5 Jim Abr…
##  5 Caddys… R      Come…  1980 July 25, 1… United States      7.3 1.08e5 Harold …
##  6 Friday… R      Horr…  1980 May 9, 1980 United States      6.4 1.23e5 Sean S.…
##  7 The Bl… R      Acti…  1980 June 20, 1… United States      7.9 1.88e5 John La…
##  8 Raging… R      Biog…  1980 December 1… United States      8.2 3.30e5 Martin …
##  9 Superm… PG     Acti…  1980 June 19, 1… United States      6.8 1.01e5 Richard…
## 10 The Lo… R      Biog…  1980 May 16, 19… United States      7   1   e4 Walter …
## # ℹ 7,658 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

For this week I also want to remove all the na’s in the columns we are going to be working with

#Get rid of NA's for score and runtime
movies_ <- movies_ |>
  drop_na(runtime)

movies_ <- movies_ |>
  drop_na(score)
  
sum(is.na(movies_$runtime))

## [1] 0

sum(is.na(movies_$score))

## [1] 0

Build Two Columns with matching numeric variables

#Short, Average, Long Movies
movies_runtime <- movies_ |>
  mutate(runtime_length = case_when(
    runtime < 90 ~ "Short",
    runtime >= 90 & runtime <= 120 ~ "Average",
    runtime > 120 ~ "Long"
  ))

movies_runtime %>%
  count(runtime_length) %>%
  arrange(desc(n))

## # A tibble: 3 × 2
##   runtime_length     n
##   <chr>          <int>
## 1 Average         5353
## 2 Long            1419
## 3 Short            889

#Score Categories (Low, Average, High, Very High)
movies_score <- movies_ |>
  mutate(score_category = case_when(
    score < 5 ~ "Low",
    score >= 5 & score <= 7 ~ "Average",
    score >= 7 & score <= 8.5 ~ "High",
    score > 8.5 ~ "Very High"
  ))

movies_score %>%
  count(score_category) %>%
  arrange(desc(n))

## # A tibble: 4 × 2
##   score_category     n
##   <chr>          <int>
## 1 Average         5121
## 2 High            1954
## 3 Low              564
## 4 Very High         22

I chose to use the run time and score columns from my data set to work on. For the run time column I decided to separate the values into short (less than 90 minutes), average (90 minutes to 120 minutes), and long (over 120 minutes). For the score column I decided to separate the values into low (less than 5), average (5 to 7), high (7 to 8.5), and very high (over 8.5). For score I chose those cutoffs as most people tend to think anything under 5 is a “bad” movie, and I wanted to have another cutoff besides just having “good” movies being from a 7 score to a 10 score.

Plot the relationships

#Runtime
ggplot(movies_runtime, aes(x = fct_reorder(runtime_length, runtime))) +
  geom_bar(fill = "black") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  labs(title = "Distribution of Runtime by Length", x = "Runtime Category", y = "Count") +
  theme_minimal()

#Score
ggplot(movies_score, aes(x = fct_reorder(score_category, score))) +
  geom_bar(fill = "black") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  labs(title = "Distribution of Score by IMDB Ratings", x = "Score Category", y = "Count") +
  theme_minimal()

I actually thought these results were really interesting to see. Starting with the run time bar graph, I did not expect almost 70% of the data to be in the average category. Maybe this shows that in the future it could be changed to 95 minutes on the low end and 115 on the high end to have all categories be more even. When looking at the score bar graph I also did not expect to see so many movies in the average category and so little in the very high category. My explanation for the very high category being so little would be that as movies are seen over time the score tends to decrease. This is because people who might not like that genre will watch it because of the insanely high ratings but then will most likely not rate it as highly as others before them. So the 22 movies that are left would tend to be safe, multi-genre movies.

Correlation Coefficients

movies_runtime$runtime_rank <- as.numeric(factor(movies_runtime$runtime_length, levels = c("Short", "Average", "Long"), ordered = TRUE))
cor(movies_runtime$runtime_rank, movies_runtime$runtime, method = "spearman")

## [1] 0.8069864

movies_score$score_rank <- as.numeric(factor(movies_score$score_category, levels = c("Low", "Average", "High", "Very High"), ordered = TRUE))
cor(movies_score$score_rank, movies_score$score, method = "spearman")

## [1] 0.8276944

For the correlations above we used the spearman method. This is because our columns were ordered and ranked and not initially numeric until we created a new column to analyze them. The values we got are very expected for the types of variables we created. For the run time ranks, we would expect as you go up in ranking, run time SHOULD increase, this is because we coded for that to occur. The same is true with score. The variations would be in the cutoffs of the different ranks

Confidence Intervals

#First we create the bootstrap function
bootstrap <- function (x, func=mean, n_iter=10^4) {
  # empty vector to be filled with values from each iteration
  func_values <- c(NULL)
  
  # we simulate sampling `n_iter` times
  for (i in 1:n_iter) {
    # pull the sample
    x_sample <- sample(x, size = length(x), replace = TRUE)
    
    # add on this iteration's value to the collection
    func_values <- c(func_values, func(x_sample))
  }
  
  return(func_values)
}

runtime_means <- bootstrap(movies_$runtime)
score_means <- bootstrap(movies_$score)

The bootstrap function helps us create the runtime_means and score_means values which are important for confidence intervals in the future.

runtime_se_boot <- sd(runtime_means)
score_se_boot <- sd(score_means)

runtime_se_est <- sd(movies_$runtime) / sqrt(length(movies_$runtime))
score_se_est <- sd(movies_$score) / sqrt(length(movies_$score))

avg_runtime <- mean(movies_$runtime)
avg_score <- mean(movies_$score)

We then create the standard error values and average values for both columns as well

#runtime confidence intervals
P <- .95  # % confidence
n <- length(movies_$runtime)

# t-statistic with n-1 degrees of freedom
t_star <- qt(p = (1 - P)/2, df=n - 1, lower.tail=FALSE)

# t half-width
CI_t <- t_star * runtime_se_boot

# z-score
z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)

# z half-width
CI_z <- z_score * runtime_se_boot


c(avg_runtime - CI_t, avg_runtime + CI_t)  # t-distribution

## [1] 106.8429 107.6779

c(avg_runtime - CI_z, avg_runtime + CI_z)  # normal distribution

## [1] 106.8430 107.6779

#score confidence intervals
P <- .95  # % confidence
n <- length(movies_$score)

# t-statistic with n-1 degrees of freedom
t_star <- qt(p = (1 - P)/2, df=n - 1, lower.tail=FALSE)

# t half-width
CI_t <- t_star * score_se_boot

# z-score
z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)

# z half-width
CI_z <- z_score * score_se_boot


c(avg_score - CI_t, avg_score + CI_t)  # t-distribution

## [1] 6.369255 6.412392

c(avg_score - CI_z, avg_score + CI_z)  # normal distribution

## [1] 6.369258 6.412389

For run time our confidence interval comes out to 106.85 and 107.67. This means we are 95% sure that the true average run time for the population that our sample was drawn from is in between these numbers. This makes sense based on all the other analysis we’ve done and having our confidence interval be so narrow also is expected as our data set has over 7000 rows.