Research Question

We will try to predict the rating of anime based upon its type, its number of votes, and its title length. We will also see which variable is the most important in predicting the anime rating.

The reasons behind the variable choice:

We wonder if these differences make some types of anime get better ratings than others.

We wonder if how popular among voters these anime need to be to get a high rating.

As we also spotted that many anime titles in the data frame we would use are fully in Japanese, we wonder if the length of these anime titles affect the voters’ ability to recognize the anime, therefore affect the anime rating.

Method used to approach

Packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.0.10     ✔ readr     2.1.4 
## ✔ forcats   1.0.0      ✔ stringr   1.5.0 
## ✔ ggplot2   3.4.2      ✔ tibble    3.2.1 
## ✔ lubridate 1.9.2      ✔ tidyr     1.2.1 
## ✔ purrr     1.0.1      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.4     ✔ rsample      1.1.1
## ✔ dials        1.0.0     ✔ tune         1.0.0
## ✔ infer        1.0.3     ✔ workflows    1.1.0
## ✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
## ✔ parsnip      1.0.2     ✔ yardstick    1.1.0
## ✔ recipes      1.0.1     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(dplyr)
library(stringr)
library(vip)
## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Data

We will use the data frame called anime_top_250 scraped from the Anime News Network website’s Anime Top 250 Best Rated (bayesian estimate) table.

Note that after we omit rows with “NA” values, our data frame is left with 242 anime in the top 250.

library(readr)
anime_top_250 <- read_csv("anime_top_250.csv")
## Rows: 242 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): title, type
## dbl (3): rank, rating, vote
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(anime_top_250)
## Rows: 242
## Columns: 5
## $ rank   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ title  <chr> "Fullmetal Alchemist: Brotherhood", "Steins;Gate", "Clannad Aft…
## $ type   <chr> "TV", "TV", "TV", "movie", "OAV", "TV", "movie", "TV", "TV", "m…
## $ rating <dbl> 9.09, 9.04, 9.04, 9.02, 8.97, 8.93, 8.92, 8.89, 8.89, 8.86, 8.8…
## $ vote   <dbl> 5868, 4702, 5198, 1166, 6660, 7898, 10584, 12302, 532, 9873, 28…

To answer our above-mentioned research question, we will introduce another variable named “title_length”, which takes the number of characters in the anime titles.

Note that we chose to count the title length based upon the number of characters instead of words because there are many Japanese phrases, when translated into latin characters, are written as a single word.

For example: The anime title “Kizumonogatari” (one of the anime included in the anime_top_250 data frame) is actually a phrase consisting of three words in Japanese instead of one as it appears in latin characters.

There is also only one anime with the type “ONA” (original net animation), so we will remove this anime from the list and only investigate the other four types: movie, TV, OAV (original video animation), and special.

#Add a new variable named title_length to the data frame
anime_top_250$title_length <- str_length(anime_top_250$title)
#Remove the only ONA type anime
anime_top_250 <- anime_top_250[!anime_top_250$type == "ONA", ]
glimpse(anime_top_250)
## Rows: 241
## Columns: 6
## $ rank         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ title        <chr> "Fullmetal Alchemist: Brotherhood", "Steins;Gate", "Clann…
## $ type         <chr> "TV", "TV", "TV", "movie", "OAV", "TV", "movie", "TV", "T…
## $ rating       <dbl> 9.09, 9.04, 9.04, 9.02, 8.97, 8.93, 8.92, 8.89, 8.89, 8.8…
## $ vote         <dbl> 5868, 4702, 5198, 1166, 6660, 7898, 10584, 12302, 532, 98…
## $ title_length <int> 32, 11, 19, 10, 33, 39, 13, 12, 26, 17, 38, 35, 9, 14, 10…

Histogram for the “rating” variable

p <- ggplot(data = anime_top_250) +
  geom_histogram(aes(x=rating), bins = 30) +
  ggtitle("Ratings of Top 250 Anime By Anime News Network") +
  xlab("Rating (Over 10)")

p

We can see that the rating of anime ranges from approximately 8.1 to 9.1. The distribution of rating is right skewed, and uni-modal, with the middle 50% of the anime receiving a rating roughly in between 8.25 and 8.5, and anime rating peaks at 8.3 with about 25 anime with this rating. This rating range is pretty high, and is generally what we would expect since these are the top 250 best anime.

5-fold Cross-validation

We apply cross validation ### Split the anime_top_250 data into training (80%), and testing (20%)

set.seed(1234)

anime_top_250_split <- initial_split(anime_top_250, prop = 0.80)

anime_top_250_train <- training(anime_top_250_split)
anime_top_250_test <- testing(anime_top_250_split)

Specify a linear regression model called anime_top_250_model for our data

anime_top_250_model <- linear_reg() %>%
  set_engine("lm")
anime_top_250_model
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Create a recipe which update title to not to be a predictor, and remove all zero variance predictors

anime_top_250_recipe <- recipe(rating ~ ., data = anime_top_250_train) %>%
  update_role(title, new_role = "id") %>%
  step_dummy(all_nominal(), -title) %>%
  step_zv(all_predictors())
anime_top_250_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##         id          1
##    outcome          1
##  predictor          4
## 
## Operations:
## 
## Dummy variables from all_nominal(), -title
## Zero variance filter on all_predictors()

Build a workflow named anime_top_250_wf for fitting the anime_top_250_model specified earlier and using the anime_top_250_recipe developed to preprocess our data

anime_top_250_wf <- workflow() %>% 
  add_model(anime_top_250_model) %>%
  add_recipe(anime_top_250_recipe)
anime_top_250_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Fit the model anime_top_250_fit to training data using the anime_top_250_wf workflow created.

anime_top_250_fit <- anime_top_250_wf%>%
  fit(data = anime_top_250_train)
anime_top_250_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##  (Intercept)          rank          vote  title_length      type_OAV  
##    8.704e+00    -2.540e-03     7.142e-06     1.149e-04     1.025e-02  
## type_special       type_TV  
##   -2.502e-02    -2.280e-02

The slope corresponding to the title length is \(1.15 \times 10^{-4}\) meaning that for every additional character in the anime title, the rating seems to increase by \(1.15 \times 10^{-4}\).

The slope corresponding to the number of vote is \(7.14 \times 10^{-6}\) suggesting that for every additional vote for the anime, its rating seems to increase by \(7.14 \times 10^{-6}\).

Split the training data into 5 folds using vfold_cv

set.seed(1234)
anime_top_250_folds <- vfold_cv(anime_top_250_train, v = 5)
anime_top_250_folds

Apply the workflow defined earlier to the folds with fit_resamples()

set.seed(1234)
anime_top_250_fit_rs <- anime_top_250_wf %>%
  fit_resamples(anime_top_250_folds)
anime_top_250_fit_rs

Report RMSE and R2 all cross validation folds

collect_metrics(anime_top_250_fit_rs)

The value of RMSE of 0.068 indicates that the model is pretty ideal in predicting the data accurately. The \(R^2\) value of 88% means that 88% of the rating could be explained by the other variables. Considering that we have a fan-based anime data frame, I think that we can proceed with this value.

Make predictions for the testing data and calculate the RMSE

anime_top_250_test_pred <- predict(anime_top_250_fit, new_data = anime_top_250_test) %>%
  bind_cols(anime_top_250_test %>% select(rating, title))
rmse(anime_top_250_test_pred, truth = rating, estimate = .pred)

The RMSE value of 0.09 for testing data is not too high compared to our RMSE for cross-validation folds above, meaning that we did not over-fit the data.

Regression Decision Tree

Specify a decision tree anime_top_250_tree_spec with rpart engine

anime_top_250_tree_spec <- decision_tree() %>%
  set_engine("rpart")
anime_top_250_tree_spec
## Decision Tree Model Specification (unknown mode)
## 
## Computational engine: rpart

Specify a decision tree anime_top_250_reg_tree_spec with regression engine

anime_top_250_reg_tree_spec <- anime_top_250_tree_spec %>%
  set_mode("regression")
anime_top_250_reg_tree_spec
## Decision Tree Model Specification (regression)
## 
## Computational engine: rpart

Specify a workflow anime_top_250_reg_tree_wf

anime_top_250_reg_tree_wf <- workflow() %>%
  add_model(anime_top_250_reg_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_formula(rating ~ . )
anime_top_250_reg_tree_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## rating ~ .
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (regression)
## 
## Main Arguments:
##   cost_complexity = tune()
## 
## Computational engine: rpart

Fit a decision tree model named anime_top_250_reg_tree_fit with title_length, type, vote as predictors using training data.

anime_top_250_reg_tree_fit <- fit(anime_top_250_reg_tree_spec, rating ~ title_length + type + vote, data = anime_top_250_train)
anime_top_250_reg_tree_fit
## parsnip model object
## 
## n= 192 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 192 7.2771920 8.384792  
##    2) vote< 6621.5 180 5.6960550 8.367167  
##      4) vote< 526 64 0.9876937 8.307813  
##        8) title_length>=11 57 0.5860842 8.290526 *
##        9) title_length< 11 7 0.2458857 8.448571 *
##      5) vote>=526 116 4.3584990 8.399914  
##       10) type=special,TV 84 2.5338670 8.376667  
##         20) title_length< 16.5 48 1.5177480 8.342708 *
##         21) title_length>=16.5 36 0.8869639 8.421944 *
##       11) type=movie,OAV 32 1.6600720 8.460937  
##         22) vote>=1266 21 0.7757143 8.434286  
##           44) vote< 2001 7 0.1180857 8.328571 *
##           45) vote>=2001 14 0.5402857 8.487143 *
##         23) vote< 1266 11 0.8409636 8.511818 *
##    3) vote>=6621.5 12 0.6864917 8.649167 *

Final fitted regression decision tree

library(rpart.plot)
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
anime_top_250_reg_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE, digits = 3)

We can see that there are 6.2% of the anime (about 14 anime) with more than 6622 votes and they resulted in a predicted rating of 8.65, which is very high given that we know the middle 50% of anime get a rating of about between 8.25 and 8.5. The anime type and title length are not important for these anime to get a high rating.

Some other combinations of anime that are predicted to have a pretty good rating are:

  1. Anime of either movie or OAV type with between 526 and 1266 votes are highly rated with an average predicted rating of 8.51.

  2. Anime of either movie or OAV type with between 2001 and 6622 votes (about 17 anime) also get a high average predicted rating of 8.49.

  3. Anime with a fewer than 11 character title and fewer than 526 votes get a pretty good rating as well, of 8.45.

  4. Anime of either or special, or TV type with a title length more than 16.5 and receiving between 526 and 6622 votes have fair ratings with an average predicted rating of 8.42.

There are 30% of the anime which have a title length greater than 11, and fewer than 526 votes. These anime end up with the lowest average predicted rating of only 8.29. The type of anime does not matter in this case.

There are 25% of the anime which belong to either special or TV type and have a title length fewer than or equal to 16.5. Although they receive a good number of votes above 526, their average predicted rating is only 8.35.

There are only 3.6% of the anime which belong to either movie or OAV type and have fewer than 2001 votes. These do not have a good average predicted rating either, at 8.33.

Variables’ level of importance in predicting the anime rating with vip() function

vip(anime_top_250_reg_tree_fit)

We can see that when it comes to predicting the rating of anime, the number of votes is the most important variable, followed by the length of title, and then type.

Conclusion

To the extend of the top 250 anime:

Method drawbacks:

The chosen data set is quite limited regarding the number of variables, so we probably have left out a lot of other independent variables that could predict the rating of an anime better than vote, title length, and type.

For an observational study like this study of anime, we cannot demonstrate causality to the entire population of anime outside the data in the given data set

As this method only creates a single regression decision tree, it is prone to change to even small changes in the data set, and the level of accuracy may not be as high as when we apply other methods. Though, we think that given the number of variables and entries of this data set, it is also hard to apply other methods like classification decision tree and random forest and expect a higher level of accuracy.

Reference

Arce, E. (2021, January 16). The meaning of OAV in the context of anime. Linguablog. Retrieved December 11, 2021, from https://linguaholic.com/linguablog/japanese-OAV/.

Erens-Basement (n.d.). It honestly gets even worse with the longer LN adopted names. Cautious Hero’s Japanese title is “Shinchou Yuusha: Kono Yuusha [Comment on the online forum post I cant remember japanese names]. Reddit. https://www.reddit.com/r/anime/comments/dulci9/i_cant_remember_japanese_names/

Anime top 250 best rated (bayesian estimate). Anime News Network. (n.d.). Retrieved December 11, 2021, from https://www.animenewsnetwork.com/encyclopedia/ratings-anime.php?top50=best_bayesian&n=250.