Studio Ghibli is a Japanese animation film studio renowned for their iconic and stunning art in their animated films. It was founded in 1985 by animated film directors Isao Takahata and Hayao Miyazaki. Together, they have produced twenty-three films ranked number one at the box office in Japan in the year in which they were released. Spirited Away is the highest grossing film in Japan, collecting a total revenue of over ¥30 billion, or over $190 million. They are always praised for their distinctive animation, simplistic art style, rich storytelling, and captivating characters and emotions (Ghibli Collection & ChatGPT).
For this project, I will be using nearly all of the variables from my Studio Ghibli dataset. The film Name, the Year it was produced, the Director of the film, the person who directed the Screenplay, the Budget, the Revenue, its Genre, and the Duration. My plan is to explore the top six Studio Ghibli films that were the most cost efficient, meaning which six films had the lowest budget and highest revenue.
Dataset Source: @shruthiiiee on Kaggle
First things first, I’m going to call in my first library. I have more that I will be using but I’ll call them in later. Then I set my working directory to the file where all of the necessities for my project are stored. Finally, I call in my dataset.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/aashkanavale/Desktop/Montgomery College/MC Spring '24/DATA101/projects/Final Project - Studio Ghibli")
ghibli <- read_csv("Studio Ghibli.csv")
## Rows: 23 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Name, Director, Screenplay, Budget, Revenue, Genre 1, Genre 2, Genr...
## dbl (1): Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ghibli
## # A tibble: 23 × 10
## Name Year Director Screenplay Budget Revenue `Genre 1` `Genre 2` `Genre 3`
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "When… 2014 Hiromas… Joan G. R… $1150… $34949… Animation Drama <NA>
## 2 "The … 2013 Isao Ta… Riko Saka… $4930… $24366… Animation Drama Fantasy
## 3 "The … 2013 Hayao M… Tatsuo Ho… $3000… $11793… Drama Animation Romance
## 4 "From… 2011 Goro Mi… Hayao Miy… $2200… $61037… Animation Drama <NA>
## 5 "The … 2010 Hiromas… Mary Nort… $2300… $14948… Fantasy Animation Family
## 6 "Pony… 2008 Hayao M… <NA> $3400… $20240… Animation Fantasy Family
## 7 "Ocea… 1994 Tomomi … Saeko Him… $5000… $10000… Animation Drama Romance
## 8 "Tale… 2006 Goro Mi… Ursula K.… $2200… $68625… Animation Fantasy Adventure
## 9 "Only… 1991 Isao Ta… <NA> $2500… $47311… Animation Drama Romance
## 10 "Spir… 2001 Hayao M… <NA> $1900… $27492… Animation Family Fantasy
## # ℹ 13 more rows
## # ℹ 1 more variable: Duration <chr>
There is quite a bit to clean up. The name of the movie also includes the year when there is already a column for year so we have to fix that. There are missing screenplay names when a quick google search can provide that. The budgets and revenues have a dollar sign before them, meaning they are not in numeric so they won’t be usable. Many of the movies’ first genre says “Animation” so we have to fix that too.
Let’s check the classes of my numerical variables first.
class(ghibli$Year)
## [1] "numeric"
class(ghibli$Budget)
## [1] "character"
class(ghibli$Revenue)
## [1] "character"
Looks like year is numeric, but the budget and revenue aren’t. We’ll get to fixing that but first, let’s begin with recoding the names of the movies to be without the year in its title.
ghibli2 <- ghibli
ghibli2$Name <- c("When Marnie Was There", "The Tale of The Princess Kaguya", "The Wind Rises", "From Up on Poppy Hill",
"The Secret World of Arrietty", "Ponyo", "Ocean Waves", "Tales from Earthsea", "Only Yesterday",
"Spirited Away", "My Neighbors the Yamadas", "Whisper of the Heart", "Grave of the Fireflies",
"My Neighbor Totoro", "Princess Mononoke", "Howl's Moving Castle", "Castle in the Sky", "Kiki's Delivery Service",
"Pom Poko", "Porco Rosso", "The Cat Returns", "Nausicaä of the Valley of the Wind", "The Boy and the Heron")
Now I’m going to recode the screeplay names by repeating the same process as above.
ghibli2$Screenplay <- c("Joan G. Robinson", "Riko Sakaguchi", "Tatsuo Hori", "Hayao Miyazaki", "Mary Norton", "Melissa Mathison",
"Saeko Himuro", "Ursula K. Le Guin", "Isao Takahata", "Hayao Miyazaki", "Isao Takahata", "Hayao Miyazaki",
"Akiyuki Nosaka", "Hayao Miyazaki", "Hayao Miyazaki", "Diana Wynne Jones", "John Semper", "Eiko Kadono",
"Isao Takahata", "Hayao Miyazaki", "Reiko Yoshida", "Kazunori Ito", "Hayao Miyazaki")
For this section, I’m removing the dollar signs from the budget and revenue variables by using the gsub() function…
ghibli2$Budget <- gsub("\\$", "", ghibli2$Budget)
ghibli2$Revenue <- gsub("\\$", "", ghibli2$Revenue)
# Source: ChatGPT
… so that now, I can convert the variables into numeric.
ghibli2$Budget <- as.numeric(ghibli2$Budget)
ghibli2$Revenue <- as.numeric(ghibli2$Revenue)
class(ghibli2$Budget)
## [1] "numeric"
class(ghibli2$Revenue)
## [1] "numeric"
Upon further inspection, I noticed that the budget and revenue variables were in Japanese Yen and not US Dollar so I multiplied both variables by 0.0064, since that is the conversion it takes to go from Yen to Dollars.
ghibli2$Budget <- ghibli2$Budget * 0.0064
ghibli2$Revenue <- ghibli2$Revenue * 0.0064
Since I don’t want any decimal places, I’m just going to round the revenue to the nearest dollar using the round() function.
ghibli2$Revenue <- round(ghibli2$Revenue)
Here I’m fixing the genres. I looked up the genre for each production where it states “Animation” so that there is a genre for each movie and repeated the same process I used above.
ghibli2$`Genre 1` <- c("Drama", "Fantasy", "Drama", "Romance",
"Fantasy", "Fantasy", "Drama", "Fantasy",
"Drama", "Fantasy", "Comedy", "Romance",
"Drama", "Fantasy", "Adventure", "Fantasy",
"Adventure", "Fantasy", "Adventure", "Family",
"Adventure", "Adventure", "Fantasy")
ghibli2
## # A tibble: 23 × 10
## Name Year Director Screenplay Budget Revenue `Genre 1` `Genre 2` `Genre 3`
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 When … 2014 Hiromas… Joan G. R… 7.36e6 223677 Drama Drama <NA>
## 2 The T… 2013 Isao Ta… Riko Saka… 3.16e5 155947 Fantasy Drama Fantasy
## 3 The W… 2013 Hayao M… Tatsuo Ho… 1.92e5 754767 Drama Animation Romance
## 4 From … 2011 Goro Mi… Hayao Miy… 1.41e5 390642 Romance Drama <NA>
## 5 The S… 2010 Hiromas… Mary Nort… 1.47e5 956675 Fantasy Animation Family
## 6 Ponyo 2008 Hayao M… Melissa M… 2.18e5 1295386 Fantasy Fantasy Family
## 7 Ocean… 1994 Tomomi … Saeko Him… 3.20e4 640 Drama Drama Romance
## 8 Tales… 2006 Goro Mi… Ursula K.… 1.41e5 439201 Fantasy Fantasy Adventure
## 9 Only … 1991 Isao Ta… Isao Taka… 1.6 e4 3028 Drama Drama Romance
## 10 Spiri… 2001 Hayao M… Hayao Miy… 1.22e5 1759521 Fantasy Family Fantasy
## # ℹ 13 more rows
## # ℹ 1 more variable: Duration <chr>
We’re done cleaning up!
Now that the dataset is ready for me to use, before I plot my final visualization, I wanted to model a linear regression first. I’m starting by using the cleaned up dataset, ghibli2, and using the ggplot(aes()) function to set my x-axis as the budget and my y-axis as the revenue. Then I use the geom_point() function to actually plot the points. Finally, I use geom_smooth() to add in my regression line.
regression <- ghibli2 |>
ggplot(aes(x = Budget, y = Revenue)) +
geom_point() +
geom_smooth(method = 'lm', formula = y~x, se = FALSE)
regression
It seems like we have 2 massive outliers: Pom Poko (top left) and When Marnie was There (bottom right) that are prohibiting us from properly viewing the rest of the points. I’m going to filter for a budget less than 2,000,000 and a revenue less than 10,000,000.
ghibli3 <- ghibli2 |>
filter(Budget < 2000000) |>
filter(Revenue < 10000000)
To make the visualization more appealing, I’m going to add in an image behind the graph. Here I’m calling in library(magick) and library(raster) because these are the two packages that will help me achieve what I’m planning to do. I use the image_read() function to set my working directory and choose the image that’s in that folder. Then I use the image_fill() and as.raster() functions to actually help bring in my image.
I found this information on the RDocumentation website.
library(magick)
## Linking to ImageMagick 6.9.12.93
## Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fftw, ghostscript, x11
library(raster)
## Loading required package: sp
##
## Attaching package: 'raster'
## The following object is masked from 'package:dplyr':
##
## select
image <- image_read("/Users/aashkanavale/Desktop/Montgomery College/MC Spring '24/DATA101/projects/Final Project - Studio Ghibli/transparent.png")
image <- image_fill(image, 'none')
image <- as.raster(image)
# Source: RDocumentation
I’m also going to change the font to one of my favorite fonts: Spectral. I bring in library(showtext) that will help me access Google fonts.
I found this information from Daniel Oehm.
library(showtext)
## Loading required package: sysfonts
## Loading required package: showtextdb
font_add_google(name = "Spectral", family = "serif-face")
showtext_auto()
# Source: Daniel Oehm, Gradient Descending
To actually plot this now, I’m calling my dataset without the outliers then setting the same axes. I use the annotation_raster() function to bring in my background image. I use geom_point() and set the color to a complimentary color. I use the same geom_point() function but this type adjust the linetype and the color.
Then I adjust the labels and add in a title and the units to my axes so it’s more clear. I also add in a caption to source the movie the background image is from and the source of the data. I adjust the theme to a non-default theme and use the theme(text = element_text()) function to bring in the font I called earlier.
regression2 <- ghibli3 |>
ggplot(aes(x = Budget, y = Revenue)) +
annotation_raster(image, -Inf, Inf, -Inf, Inf, interpolate = FALSE) + # Source: RDocumentation
geom_point(color = "#1a3e6d") +
geom_smooth(method = 'lm', formula = y~x, se = FALSE, linetype = "twodash", color = "forestgreen") +
labs(title = "Studio Ghibli Budget vs. Revenue Relationship",
x = "Revenue in Millions",
y = "Budget in Millions",
caption = "Howl's Moving Castle\nData Source: @shruthiiiee on Kaggle") +
theme_classic() +
theme(text = element_text(family = "serif-face"))
regression2
To actually see the correlation, I’m using the cor() function and then using the lm() function to get all of the necessary data I need to make a conclusion.
cor(ghibli3$Revenue, ghibli3$Budget)
## [1] 0.457354
fit <- lm(Budget ~ Revenue, data = ghibli3)
summary(fit)
##
## Call:
## lm(formula = Budget ~ Revenue, data = ghibli3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -149107 -51333 -34726 21256 449398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.665e+04 4.016e+04 1.660 0.1134
## Revenue 1.160e-01 5.173e-02 2.242 0.0371 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 129000 on 19 degrees of freedom
## Multiple R-squared: 0.2092, Adjusted R-squared: 0.1676
## F-statistic: 5.025 on 1 and 19 DF, p-value: 0.03711
“cor()” stands for “correlation”. This value is always between -1 and 1. The correlation coefficient tells us how strong or weak the correlation is. Values closer to positive or negative 1 are strong correlation (the sign is determined by the linear slope and in this case, the linear slope is positive), values close to positive or negative 0.5 show a weak correlation, and values close to zero have no correlation.
Because my value is 0.457354, it’s above 0 but really close to 0.5, meaning that it has a weak positive correlation, but it’s still there.
For a linear regression, the equation (y = mx + b) must be
used.
The equation for my model is : Budget = 0.11597(Revenue) +
66651.64252
How do we interpret the equation? As the revenue increases, there is a predicted increase in budget by 0.11597.
To check if the results are significant, we must look at the p-value. The levels of significance are typically 0.05, or 5%. My p-value is 0.03711, which is extremely close to 0. The p-value is considered very significant to this entire experiment when we are investigating the correlation, it means that there is an extremely weak yet still positive correlation between revenue and budget.
Let’s plot my final visualization. I wanted to explore the relationship between the budget and revenue. However, I needed to turn this into another variable so I created a new column called NetProfit. I created this by subtracting the revenue from the budget so I can find the profit of the movie. I’m using the dataset without the outliers so we can fully see the other movies.
ghibli4 <- ghibli3 |>
mutate("NetProfit" = Revenue - Budget)
Here I’m grouping by the Genre and and finding the maxmium profit from each genre.
ghibli4 <- ghibli4 |>
group_by(`Genre 1`) |>
filter(NetProfit == max(NetProfit))
Finally, I’m arranging the the order of the movies in ascending order.
ghibli4 <- ghibli4 |>
arrange(NetProfit)
ghibli4
## # A tibble: 6 × 11
## # Groups: Genre 1 [6]
## Name Year Director Screenplay Budget Revenue `Genre 1` `Genre 2` `Genre 3`
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Porco … 1992 Hayao M… Hayao Miy… 58880 218240 Family Comedy Animation
## 2 From U… 2011 Goro Mi… Hayao Miy… 140800 390642 Romance Drama <NA>
## 3 The Wi… 2013 Hayao M… Tatsuo Ho… 192000 754767 Drama Animation Romance
## 4 Prince… 1997 Hayao M… Hayao Miy… 150400 1081600 Adventure Fantasy Animation
## 5 My Nei… 1999 Isao Ta… Isao Taka… 131200 1068800 Comedy Family <NA>
## 6 Spirit… 2001 Hayao M… Hayao Miy… 121600 1759521 Fantasy Family Fantasy
## # ℹ 2 more variables: Duration <chr>, NetProfit <dbl>
Here I created a custom color vector from a palette I found online.
desiredcolors <- c("#D2C98A", "#FD9584", "#A66D45", "#F6B6A7", "#F3EAD6", "#A7A155")
I’m calling in library(highcharter) because this is the package that will give me an interesting factor to my visualization: tooltips. I’m beginning by using the hchart() function and calling in my most recent dataset, ghibli4, choosing the chart type, which is a bar graph, and setting my axes to the Name of the movie (x-axis), the profit (y-axis), and coloring my my color vector I created above.
Then I added my background image. I add my title and axes labels. For the tooltips, I use the {point.variablename} function to actually pull each value from the variable needed. I do that for every variable I intend to use for my visualization. Finally, I adjust the font again to Spectral and add in the caption for the source of the image.
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
plot <- hchart(object = ghibli4,
type = "column",
hcaes(x = Name, y = NetProfit, color = desiredcolors)) |>
hc_chart(backgroundColor = "transparent",
divBackgroundImage = "https://i.pinimg.com/736x/72/1f/92/721f925b073c1e8aaa01af263be15f29.jpg") |>
hc_title(text = "Studio Ghibli Top 6 Profits by Genre") |>
hc_xAxis(title = list(text = "Production Name")) |>
hc_yAxis(title = list(text = "Net Profit in Millions")) |>
hc_tooltip(shared = F, pointFormat = "Year: {point.Year}<br> Director: {point.Director}<br> Budget: ${point.Budget}<br> Revenue: ${point.Revenue}<br> Genre: {point.Genre 1}<br> Run Time: {point.Duration}<br> Profit: ${point.NetProfit}") |>
hc_chart(style = list(fontFamily = "Spectral",
fontWeight = "bold")) |>
hc_caption(text = "My Neighbor Totoro")
plot
Throughout my project, I was able to find that there actually isn’t a correlation between budget and revenue. There are some productions that had a low budget and did extremely well and others that had a high budget but flopped. Without the 2 outliers (Pom Poko and When Marnie was There), Spirited Away is the movie with the highest profit.
Based on the 6 genres given to us (comedy, drama, adventure, fantasy, romance, and family), Spirited Away, My Neighbors the Yamadas, Princess Mononoke, The Wind Rises, From Up on Poppy Hill, and Porco Rosso were the most cost efficient movies that had low budgets and higher revenues, making more money than what they spent on production.
Other ways to explore this dataset would be to look that the movies that didn’t do well, meaning they had high budgets and low revenue.
Dataset: https://www.kaggle.com/datasets/shruthiiiee/studio-ghibli-dataset
Studio Ghibli Description:
https://ghiblicollection.com/pages/about-us
https://chat.openai.com/share/f8c16cc0-be11-4ad1-8146-2c0c060f24cf
Coding:
https://www.rdocumentation.org/packages/magick/versions/2.8.3/topics/image_ggplot
https://gradientdescending.com/adding-custom-fonts-to-ggplot-in-r/