Final Project Data 110

Introduction

The data set I will be using is a collection of all Disney Movies up to 2016 showing their release date, genre, mpaa rating, total gross, and a adjusted gross based on inflation. I obtained it on Kaggle from user Prateek Majumder last updated 5 years ago. I will be focusing on the movie’s genre and mpaa rating to see if it correlates with the movies adjusted gross to see if curtain ratings or genres in Disney movies make different amounts of money. I personally think that Disney over time has lost a lot of quality and I hope seeing this can help to realize what Disney did right or wrong.

There are many movie genres and Disney had only made one or two movies of some genres, and hundreds of movies from other genres. So I combined curtain movie genres with others to simplify things. I put black comedy with comedy, Western to Adventure, and horror to thriller/suspense. I also removed all concerts as there were only two observations and it doesn’t combine well with other genres. The inflation adjusted gross was also very scewed to the left, so I logged the variable to centralize the data. I also extracted the movies month, decade and season, of creation to test for correlation. I used a GLM fit to try to predict a movie’s logged gross by its genre, mpaa rating, season release, and decade release. Due to it being a GlM fit, I had no R square to see how well my line fits into my data so instead I compared the residuals and estimate of the line to the original data.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr   1.1.4     ✔ readr   2.1.5
## ✔ forcats 1.0.0     ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1     ✔ tibble  3.2.1
## ✔ purrr   1.0.4     ✔ tidyr   1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
setwd("C:/Users/Jacob/Downloads/archive (3)")
disney <- read.csv("Disney_Movies.csv")

Cleaning Data

disney |> #seeing what cleaning should be done
  ggplot(aes(x = mpaa_rating, fill = genre)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set1") +
  theme(
    axis.title.y = element_blank(),
    panel.grid = element_blank()
  )

## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors

Clearly needs a lot of cleaning.

disney_clean <- disney |>
  select(mpaa_rating, inflation_adjusted_gross, genre, release_date, movie_title)|>#Taking only two categorical identifiers:mpaa rating and genre + numeric columns: gross and release date
  filter(!mpaa_rating %in% c("", "Not Rated")) |> 
  filter(!genre %in% c("","Concert/Performance")) |> #removing NA's and concert has very little counts and doesn't fit into other genres
  mutate( #change genres with low counts into genres that are more or less the same
    genre = case_when(
      genre == "Western" ~ "Adventure",
      genre == "Horror" ~ "Thriller/Suspense",
      genre == "Black Comedy" ~ "Comedy",
      TRUE ~ genre  
    )
  ) |>
  filter(year(release_date) >= 1990) #only looking at movies from 1990's to now
  
   
disney_clean |> #seeing how clean it looks now
  ggplot(aes(x = mpaa_rating, fill = genre)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set1") +
  theme(
    axis.title.y = element_blank(),
    panel.grid = element_blank()
  )

Much better. Now to see skewness of numerical variables:

disney_clean |>
  ggplot(aes(x = inflation_adjusted_gross, fill = mpaa_rating)) +
  geom_histogram(bins = 200) +
    theme(
    axis.title.y = element_blank(),
    panel.grid = element_blank()
  )

The total gross is extremely scewed left with some out liars. When doing the generalized linear model test I’ll make sure to scale the data but for now I’ll log scale it.

disney_clean <- disney_clean |>
  mutate(log_gross = log(inflation_adjusted_gross)) |> #new column for the logged gross
    mutate(
    month = month(release_date),    #extracting month and year          
    year = year(release_date),                    
    season = case_when(
      month %in% c(12, 1, 2) ~ "Winter", #Finding season movie was released
      month %in% c(3, 4, 5)  ~ "Spring",
      month %in% c(6, 7, 8)  ~ "Summer",
      month %in% c(9, 10, 11) ~ "Fall"
    ),
    decade = paste(floor(year / 10) * 10, "s", sep = "")  # finding the decade the movie was released
  )


disney_clean |> # seeing how skewed the data is now
  ggplot(aes(x = log_gross, fill = mpaa_rating)) + 
  geom_histogram(bins = 200) +
  theme(
    axis.title.y = element_blank(),
    panel.grid = element_blank()
  )

There are still some out liars but the skewness is much better.

Statistical Test

fit <- glm(log_gross ~  mpaa_rating + genre + decade + season, data = disney_clean)
summary(fit)

## 
## Call:
## glm(formula = log_gross ~ mpaa_rating + genre + decade + season, 
##     data = disney_clean)
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            19.18460    0.32981  58.168  < 2e-16 ***
## mpaa_ratingPG          -0.57251    0.20065  -2.853 0.004523 ** 
## mpaa_ratingPG-13       -0.82163    0.22330  -3.679 0.000261 ***
## mpaa_ratingR           -1.47409    0.24890  -5.922 6.22e-09 ***
## genreAdventure         -0.66237    0.27589  -2.401 0.016753 *  
## genreComedy            -1.03591    0.25947  -3.992 7.61e-05 ***
## genreDocumentary       -3.82072    0.44925  -8.505 2.59e-16 ***
## genreDrama             -1.21345    0.26689  -4.547 6.98e-06 ***
## genreMusical           -1.05452    0.50055  -2.107 0.035685 *  
## genreRomantic Comedy   -1.00942    0.36877  -2.737 0.006437 ** 
## genreThriller/Suspense -0.67986    0.34216  -1.987 0.047521 *  
## decade2000s             0.07204    0.14500   0.497 0.619552    
## decade2010s             0.27823    0.18824   1.478 0.140076    
## seasonSpring            0.03777    0.17226   0.219 0.826544    
## seasonSummer            0.35227    0.17689   1.991 0.047026 *  
## seasonWinter            0.04611    0.17818   0.259 0.795914    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.797228)
## 
##     Null deviance: 1082.84  on 474  degrees of freedom
## Residual deviance:  824.93  on 459  degrees of freedom
## AIC: 1644.2
## 
## Number of Fisher Scoring iterations: 2

Graphing GLM fit

ggplot(disney_clean, aes(x = log_gross, y = abs(fit$residuals))) +
  geom_bin2d(bins = 40) +
  geom_point(alpha = 0.1,size = 2) +
  scale_fill_distiller(palette = "Blues") +
  labs(
    title = "Residuals vs Original Data",
    caption = "Data Taken from Movies made by Disney",
    x = "Original Logged Gross",
    y = "Residual of Data Point"
  ) +
  theme(
    panel.grid = element_line(linetype = "dashed", color = "black"),
    panel.background = element_rect(fill = "grey"),
    panel.border = element_rect(color = "black", fill = NA, size = 1.5)
  ) +
  scale_x_continuous(breaks = 8:21)

## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(disney_clean, aes(x = log_gross, y = fit$fitted.values, fill = genre)) +
  geom_hex(alpha = 0.7, size = 3) +
  facet_grid(~decade)+
    labs(
    size = 0.7, 
    title = "Estimated Value VS Orignal Value",
    caption = "Data Taken from Movies made by Disney",
    x = "Original Logged Gross",
    y = "Estimated Logged Gross"
  ) +
  theme(
    panel.grid = element_line(color = "darkgrey"),
    panel.background = element_rect(fill = "lightgrey"),
    panel.border = element_rect(color = "black", fill = NA, size = 1.5),
    plot.caption = element_text(hjust = 0.5),
    legend.position = "bottom",
    strip.text = element_blank()
  ) +
  scale_fill_brewer(palette = "Set1") +
  scale_x_continuous(breaks = 8:21) +
  geom_text(data = disney_clean, aes(x = 9.5, y = 17.5, label = decade), fontface = "bold", size = 5)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(disney_clean, aes(x = log_gross, y = fit$fitted.values, fill = mpaa_rating)) +
  geom_hex(alpha = 0.7, size = 3) +
  facet_grid(~decade)+
    labs(
    size = 0.7, 
    title = "Estimated Value VS Orignal Value",
    caption = "Data Taken from Movies made by Disney",
    x = "Original Logged Gross",
    y = "Estimated Logged Gross"
  ) +
  theme(
    panel.grid = element_line(color = "darkgrey"),
    panel.background = element_rect(fill = "lightgrey"),
    panel.border = element_rect(color = "black", fill = NA, size = 1.5),
    plot.caption = element_text(hjust = 0.5),
    legend.position = "bottom",
    legend.key = element_blank(),
    strip.text = element_blank()
  ) +
  scale_fill_brewer(palette = "Set1") +
  scale_x_continuous(breaks = 8:21) +
  geom_text(data = disney_clean, aes(x = 9.5, y = 17.5, label = decade), fontface = "bold", size = 5)

My glm fit gives significant p values for a movie’s rating and the movie’s genre, but a less significant correlation to the movie’s decade and season. Rated R movies and Documentary movies are correlated with the lowest estimates, meaning Rated R and Documentaries are not good options for Disney to make money. Action movies and Rated G are correlated with the highest estimates, so Rated G and Action are good options for Disney to make money. Adventure movies tend to make a fair amount of money too, but not as much as Action movies. As movies get a higher maturity rating, they make less money.

For movies that made a logged gross of about 16 to 20 the glm line has a low residual which means the prediction line is only off by about 1 or 2 units or none at all. That area between movies that made a logged gross of 16 to 20 also has a large amount of movies, so most of Disney movies made about this much money. For movies that made less money the prediction line is bad at predicting it’s logged gross, however there are not many movies that made under 15. The fit line also over estimates many movies thinking the movie would make more money than it actually does. All in all I believe my GLM fitted line is a good prediction of a movie’s logged gross with slight over estimation.

Final Project Data 110

Jacob Dawson

2025-04-28