Is There a Correlation Between Studio Ghibli Productions’ Budget and Revenue?

Introduction

Studio Ghibli is a Japanese animation film studio renowned for their iconic and stunning art in their animated films. It was founded in 1985 by animated film directors Isao Takahata and Hayao Miyazaki. Together, they have produced twenty-three films ranked number one at the box office in Japan in the year in which they were released. Spirited Away is the highest grossing film in Japan, collecting a total revenue of over ¥30 billion, or over $190 million. They are always praised for their distinctive animation, simplistic art style, rich storytelling, and captivating characters and emotions (Ghibli Collection & ChatGPT).

For this project, I will be using nearly all of the variables from my Studio Ghibli dataset. The film Name, the Year it was produced, the Director of the film, the person who directed the Screenplay, the Budget, the Revenue, its Genre, and the Duration. My plan is to explore the top six Studio Ghibli films that were the most cost efficient, meaning which six films had the lowest budget and highest revenue.

Dataset Source: @shruthiiiee on Kaggle

Data Analysis

First things first, I’m going to call in my first library. I have more that I will be using but I’ll call them in later. Then I set my working directory to the file where all of the necessities for my project are stored. Finally, I call in my dataset.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("/Users/aashkanavale/Desktop/Montgomery College/MC Spring '24/DATA101/projects/Final Project - Studio Ghibli")
ghibli <- read_csv("Studio Ghibli.csv")

## Rows: 23 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Name, Director, Screenplay, Budget, Revenue, Genre 1, Genre 2, Genr...
## dbl (1): Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ghibli

## # A tibble: 23 × 10
##    Name    Year Director Screenplay Budget Revenue `Genre 1` `Genre 2` `Genre 3`
##    <chr>  <dbl> <chr>    <chr>      <chr>  <chr>   <chr>     <chr>     <chr>    
##  1 "When…  2014 Hiromas… Joan G. R… $1150… $34949… Animation Drama     <NA>     
##  2 "The …  2013 Isao Ta… Riko Saka… $4930… $24366… Animation Drama     Fantasy  
##  3 "The …  2013 Hayao M… Tatsuo Ho… $3000… $11793… Drama     Animation Romance  
##  4 "From…  2011 Goro Mi… Hayao Miy… $2200… $61037… Animation Drama     <NA>     
##  5 "The …  2010 Hiromas… Mary Nort… $2300… $14948… Fantasy   Animation Family   
##  6 "Pony…  2008 Hayao M… <NA>       $3400… $20240… Animation Fantasy   Family   
##  7 "Ocea…  1994 Tomomi … Saeko Him… $5000… $10000… Animation Drama     Romance  
##  8 "Tale…  2006 Goro Mi… Ursula K.… $2200… $68625… Animation Fantasy   Adventure
##  9 "Only…  1991 Isao Ta… <NA>       $2500… $47311… Animation Drama     Romance  
## 10 "Spir…  2001 Hayao M… <NA>       $1900… $27492… Animation Family    Fantasy  
## # ℹ 13 more rows
## # ℹ 1 more variable: Duration <chr>

There is quite a bit to clean up. The name of the movie also includes the year when there is already a column for year so we have to fix that. There are missing screenplay names when a quick google search can provide that. The budgets and revenues have a dollar sign before them, meaning they are not in numeric so they won’t be usable. Many of the movies’ first genre says “Animation” so we have to fix that too.

Let’s check the classes of my numerical variables first.

class(ghibli$Year)

## [1] "numeric"

class(ghibli$Budget)

## [1] "character"

class(ghibli$Revenue)

## [1] "character"

Looks like year is numeric, but the budget and revenue aren’t. We’ll get to fixing that but first, let’s begin with recoding the names of the movies to be without the year in its title.

ghibli2 <- ghibli

ghibli2$Name <- c("When Marnie Was There", "The Tale of The Princess Kaguya", "The Wind Rises", "From Up on Poppy Hill",
                  "The Secret World of Arrietty", "Ponyo", "Ocean Waves", "Tales from Earthsea", "Only Yesterday",
                  "Spirited Away", "My Neighbors the Yamadas", "Whisper of the Heart", "Grave of the Fireflies",
                  "My Neighbor Totoro", "Princess Mononoke", "Howl's Moving Castle", "Castle in the Sky", "Kiki's Delivery Service",
                  "Pom Poko", "Porco Rosso", "The Cat Returns", "Nausicaä of the Valley of the Wind", "The Boy and the Heron")

Now I’m going to recode the screeplay names by repeating the same process as above.

ghibli2$Screenplay <- c("Joan G. Robinson", "Riko Sakaguchi", "Tatsuo Hori", "Hayao Miyazaki", "Mary Norton", "Melissa Mathison", 
                       "Saeko Himuro", "Ursula K. Le Guin", "Isao Takahata", "Hayao Miyazaki", "Isao Takahata", "Hayao Miyazaki",
                       "Akiyuki Nosaka", "Hayao Miyazaki", "Hayao Miyazaki", "Diana Wynne Jones", "John Semper", "Eiko Kadono",
                       "Isao Takahata", "Hayao Miyazaki", "Reiko Yoshida", "Kazunori Ito", "Hayao Miyazaki")

For this section, I’m removing the dollar signs from the budget and revenue variables by using the gsub() function…

ghibli2$Budget <- gsub("\\$", "", ghibli2$Budget)
ghibli2$Revenue <- gsub("\\$", "", ghibli2$Revenue)

# Source: ChatGPT

… so that now, I can convert the variables into numeric.

ghibli2$Budget <- as.numeric(ghibli2$Budget)
ghibli2$Revenue <- as.numeric(ghibli2$Revenue)

class(ghibli2$Budget)

## [1] "numeric"

class(ghibli2$Revenue)

## [1] "numeric"

Upon further inspection, I noticed that the budget and revenue variables were in Japanese Yen and not US Dollar so I multiplied both variables by 0.0064, since that is the conversion it takes to go from Yen to Dollars.

ghibli2$Budget <- ghibli2$Budget * 0.0064
ghibli2$Revenue <- ghibli2$Revenue * 0.0064

Since I don’t want any decimal places, I’m just going to round the revenue to the nearest dollar using the round() function.

ghibli2$Revenue <- round(ghibli2$Revenue)

Here I’m fixing the genres. I looked up the genre for each production where it states “Animation” so that there is a genre for each movie and repeated the same process I used above.

ghibli2$`Genre 1` <- c("Drama", "Fantasy", "Drama", "Romance", 
                       "Fantasy", "Fantasy", "Drama", "Fantasy", 
                       "Drama", "Fantasy", "Comedy", "Romance", 
                       "Drama", "Fantasy", "Adventure", "Fantasy",
                       "Adventure", "Fantasy", "Adventure", "Family",
                       "Adventure", "Adventure", "Fantasy")
ghibli2

## # A tibble: 23 × 10
##    Name    Year Director Screenplay Budget Revenue `Genre 1` `Genre 2` `Genre 3`
##    <chr>  <dbl> <chr>    <chr>       <dbl>   <dbl> <chr>     <chr>     <chr>    
##  1 When …  2014 Hiromas… Joan G. R… 7.36e6  223677 Drama     Drama     <NA>     
##  2 The T…  2013 Isao Ta… Riko Saka… 3.16e5  155947 Fantasy   Drama     Fantasy  
##  3 The W…  2013 Hayao M… Tatsuo Ho… 1.92e5  754767 Drama     Animation Romance  
##  4 From …  2011 Goro Mi… Hayao Miy… 1.41e5  390642 Romance   Drama     <NA>     
##  5 The S…  2010 Hiromas… Mary Nort… 1.47e5  956675 Fantasy   Animation Family   
##  6 Ponyo   2008 Hayao M… Melissa M… 2.18e5 1295386 Fantasy   Fantasy   Family   
##  7 Ocean…  1994 Tomomi … Saeko Him… 3.20e4     640 Drama     Drama     Romance  
##  8 Tales…  2006 Goro Mi… Ursula K.… 1.41e5  439201 Fantasy   Fantasy   Adventure
##  9 Only …  1991 Isao Ta… Isao Taka… 1.6 e4    3028 Drama     Drama     Romance  
## 10 Spiri…  2001 Hayao M… Hayao Miy… 1.22e5 1759521 Fantasy   Family    Fantasy  
## # ℹ 13 more rows
## # ℹ 1 more variable: Duration <chr>

We’re done cleaning up!

Statistical Analysis: Linear Regression

Now that the dataset is ready for me to use, before I plot my final visualization, I wanted to model a linear regression first. I’m starting by using the cleaned up dataset, ghibli2, and using the ggplot(aes()) function to set my x-axis as the budget and my y-axis as the revenue. Then I use the geom_point() function to actually plot the points. Finally, I use geom_smooth() to add in my regression line.

regression <- ghibli2 |>
  ggplot(aes(x = Budget, y = Revenue)) +
  geom_point() +
  geom_smooth(method = 'lm', formula = y~x, se = FALSE) 
regression

It seems like we have 2 massive outliers: Pom Poko (top left) and When Marnie was There (bottom right) that are prohibiting us from properly viewing the rest of the points. I’m going to filter for a budget less than 2,000,000 and a revenue less than 10,000,000.

ghibli3 <- ghibli2 |>
  filter(Budget < 2000000) |>
  filter(Revenue < 10000000)

To make the visualization more appealing, I’m going to add in an image behind the graph. Here I’m calling in library(magick) and library(raster) because these are the two packages that will help me achieve what I’m planning to do. I use the image_read() function to set my working directory and choose the image that’s in that folder. Then I use the image_fill() and as.raster() functions to actually help bring in my image.

I found this information on the RDocumentation website.

library(magick)

## Linking to ImageMagick 6.9.12.93
## Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fftw, ghostscript, x11

library(raster)

## Loading required package: sp

## 
## Attaching package: 'raster'

## The following object is masked from 'package:dplyr':
## 
##     select

image <- image_read("/Users/aashkanavale/Desktop/Montgomery College/MC Spring '24/DATA101/projects/Final Project - Studio Ghibli/transparent.png")
image <- image_fill(image, 'none')
image <- as.raster(image)

# Source: RDocumentation

I’m also going to change the font to one of my favorite fonts: Spectral. I bring in library(showtext) that will help me access Google fonts.

I found this information from Daniel Oehm.

library(showtext)

## Loading required package: sysfonts

## Loading required package: showtextdb

font_add_google(name = "Spectral", family = "serif-face") 
showtext_auto()

# Source: Daniel Oehm, Gradient Descending

To actually plot this now, I’m calling my dataset without the outliers then setting the same axes. I use the annotation_raster() function to bring in my background image. I use geom_point() and set the color to a complimentary color. I use the same geom_point() function but this type adjust the linetype and the color.

Then I adjust the labels and add in a title and the units to my axes so it’s more clear. I also add in a caption to source the movie the background image is from and the source of the data. I adjust the theme to a non-default theme and use the theme(text = element_text()) function to bring in the font I called earlier.

regression2 <- ghibli3 |>
  ggplot(aes(x = Budget, y = Revenue)) +
  annotation_raster(image, -Inf, Inf, -Inf, Inf, interpolate = FALSE) + # Source: RDocumentation
  geom_point(color = "#1a3e6d") +
  geom_smooth(method = 'lm', formula = y~x, se = FALSE, linetype = "twodash", color = "forestgreen") +
  labs(title = "Studio Ghibli Budget vs. Revenue Relationship",
       x = "Revenue in Millions",
       y = "Budget in Millions",
       caption = "Howl's Moving Castle\nData Source: @shruthiiiee on Kaggle") +
  theme_classic() +
  theme(text = element_text(family = "serif-face"))
regression2

To actually see the correlation, I’m using the cor() function and then using the lm() function to get all of the necessary data I need to make a conclusion.

cor(ghibli3$Revenue, ghibli3$Budget)

## [1] 0.457354

fit <- lm(Budget ~ Revenue, data = ghibli3)
summary(fit)

## 
## Call:
## lm(formula = Budget ~ Revenue, data = ghibli3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -149107  -51333  -34726   21256  449398 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 6.665e+04  4.016e+04   1.660   0.1134  
## Revenue     1.160e-01  5.173e-02   2.242   0.0371 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 129000 on 19 degrees of freedom
## Multiple R-squared:  0.2092, Adjusted R-squared:  0.1676 
## F-statistic: 5.025 on 1 and 19 DF,  p-value: 0.03711

“cor()” stands for “correlation”. This value is always between -1 and 1. The correlation coefficient tells us how strong or weak the correlation is. Values closer to positive or negative 1 are strong correlation (the sign is determined by the linear slope and in this case, the linear slope is positive), values close to positive or negative 0.5 show a weak correlation, and values close to zero have no correlation.

Because my value is 0.457354, it’s above 0 but really close to 0.5, meaning that it has a weak positive correlation, but it’s still there.

For a linear regression, the equation (y = mx + b) must be used.
The equation for my model is : Budget = 0.11597(Revenue) + 66651.64252

How do we interpret the equation? As the revenue increases, there is a predicted increase in budget by 0.11597.

To check if the results are significant, we must look at the p-value. The levels of significance are typically 0.05, or 5%. My p-value is 0.03711, which is extremely close to 0. The p-value is considered very significant to this entire experiment when we are investigating the correlation, it means that there is an extremely weak yet still positive correlation between revenue and budget.

Final Visualization

Let’s plot my final visualization. I wanted to explore the relationship between the budget and revenue. However, I needed to turn this into another variable so I created a new column called NetProfit. I created this by subtracting the revenue from the budget so I can find the profit of the movie. I’m using the dataset without the outliers so we can fully see the other movies.

ghibli4 <- ghibli3 |>
  mutate("NetProfit" = Revenue - Budget)

Here I’m grouping by the Genre and and finding the maxmium profit from each genre.

ghibli4 <- ghibli4 |>
  group_by(`Genre 1`) |>
  filter(NetProfit == max(NetProfit))

Finally, I’m arranging the the order of the movies in ascending order.

ghibli4 <- ghibli4 |>
  arrange(NetProfit)
ghibli4

## # A tibble: 6 × 11
## # Groups:   Genre 1 [6]
##   Name     Year Director Screenplay Budget Revenue `Genre 1` `Genre 2` `Genre 3`
##   <chr>   <dbl> <chr>    <chr>       <dbl>   <dbl> <chr>     <chr>     <chr>    
## 1 Porco …  1992 Hayao M… Hayao Miy…  58880  218240 Family    Comedy    Animation
## 2 From U…  2011 Goro Mi… Hayao Miy… 140800  390642 Romance   Drama     <NA>     
## 3 The Wi…  2013 Hayao M… Tatsuo Ho… 192000  754767 Drama     Animation Romance  
## 4 Prince…  1997 Hayao M… Hayao Miy… 150400 1081600 Adventure Fantasy   Animation
## 5 My Nei…  1999 Isao Ta… Isao Taka… 131200 1068800 Comedy    Family    <NA>     
## 6 Spirit…  2001 Hayao M… Hayao Miy… 121600 1759521 Fantasy   Family    Fantasy  
## # ℹ 2 more variables: Duration <chr>, NetProfit <dbl>

Here I created a custom color vector from a palette I found online.

desiredcolors <- c("#D2C98A", "#FD9584", "#A66D45", "#F6B6A7", "#F3EAD6", "#A7A155")

I’m calling in library(highcharter) because this is the package that will give me an interesting factor to my visualization: tooltips. I’m beginning by using the hchart() function and calling in my most recent dataset, ghibli4, choosing the chart type, which is a bar graph, and setting my axes to the Name of the movie (x-axis), the profit (y-axis), and coloring my my color vector I created above.

Then I added my background image. I add my title and axes labels. For the tooltips, I use the {point.variablename} function to actually pull each value from the variable needed. I do that for every variable I intend to use for my visualization. Finally, I adjust the font again to Spectral and add in the caption for the source of the image.

library(highcharter)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Highcharts (www.highcharts.com) is a Highsoft software product which is

## not free for commercial and Governmental use

plot <- hchart(object = ghibli4,
               type = "column",
               hcaes(x = Name, y = NetProfit, color = desiredcolors)) |>
  hc_chart(backgroundColor = "transparent", 
           divBackgroundImage = "https://i.pinimg.com/736x/72/1f/92/721f925b073c1e8aaa01af263be15f29.jpg") |>
  hc_title(text = "Studio Ghibli Top 6 Profits by Genre") |>
  hc_xAxis(title = list(text = "Production Name")) |>
  hc_yAxis(title = list(text = "Net Profit in Millions")) |>
  hc_tooltip(shared = F, pointFormat = "Year: {point.Year}<br> Director: {point.Director}<br> Budget: ${point.Budget}<br> Revenue: ${point.Revenue}<br> Genre: {point.Genre 1}<br> Run Time: {point.Duration}<br> Profit: ${point.NetProfit}") |>
  hc_chart(style = list(fontFamily = "Spectral",
                        fontWeight = "bold")) |>
  hc_caption(text = "My Neighbor Totoro")
plot

Conclusion

Throughout my project, I was able to find that there actually isn’t a correlation between budget and revenue. There are some productions that had a low budget and did extremely well and others that had a high budget but flopped. Without the 2 outliers (Pom Poko and When Marnie was There), Spirited Away is the movie with the highest profit.

Based on the 6 genres given to us (comedy, drama, adventure, fantasy, romance, and family), Spirited Away, My Neighbors the Yamadas, Princess Mononoke, The Wind Rises, From Up on Poppy Hill, and Porco Rosso were the most cost efficient movies that had low budgets and higher revenues, making more money than what they spent on production.

Other ways to explore this dataset would be to look that the movies that didn’t do well, meaning they had high budgets and low revenue.

References

Dataset: https://www.kaggle.com/datasets/shruthiiiee/studio-ghibli-dataset

Studio Ghibli Description:
https://ghiblicollection.com/pages/about-us
https://chat.openai.com/share/f8c16cc0-be11-4ad1-8146-2c0c060f24cf

Coding:
https://www.rdocumentation.org/packages/magick/versions/2.8.3/topics/image_ggplot
https://gradientdescending.com/adding-custom-fonts-to-ggplot-in-r/

DATA101 Final Project

Aashka Navale

2024-04-23