Anime Data Project

Author

Mardan Mirzaguliyev

Published

October 18, 2023

INTRODUCTION

When I worked at the book store, young people, especially teenagers used to come and ask about animes. Some of them had watched TV episodes of these animes and now wanted to try out the book version of them. Personally, I was not a fan of animes but as time went by I started to shift my attitude to them. So, when I found this data set on Kaggle, I decided to pick it up and try to better understand anime as an artistic genre and if possible get answers to some of the general questions like what attracts teenagers to this genre and some of the specific questions like what are the most popular topics and what are the most popular sub-genres like crime, horror, sci-fi and so on. Of course, when the data set was explored in deep, more questions arises and I tried to write them down as well, and answer them.

As mentioned above the data set was downloaded from the data set section of Kaggle website. In the about and context section of data set it is informed that it contains “information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings”.

There two CSV files in the data set:

  • anime.csv

  • rating.csv

    anime.csv file have the below variables and corresponding observations:

  • anime_id - myanimelist.net’s unique id identifying an anime.

  • name - full name of anime.

  • genre - comma separated list of genres for this anime.

  • type - movie, TV, OVA, etc.

  • episodes - how many episodes in this show. (1 if movie).

  • rating - average rating out of 10 for this anime.

    members - number of community members that are in this anime’s “group”.

    rating.csv file have three variables:

  • user_id - non identifiable randomly generated user id.

  • anime_id - the anime that this user has rated.

  • rating - rating out of 10 this user has assigned (-1 if the user watched it but didn’t assign a rating).

    Some of the terms related to the context:

  • OVA - original video animation

  • ONA - original net animation

  • Special - (aka TV Special) is not weekly. Usually yearly or one shot. It’s have only one episode but it’s have longer length (ex 2 hours). It’s still intended for broadcast. Need to meet broadcast code.

    ASK - QUESTIONS TO ANSWER

    These were the questions considered the most important to answer and get insights about:

    1. What are the most popular genres for animes (Number of mangas represented)?
    2. What are the most popular types for animes (Number of mangas represented)?
    3. What are the maximum and minimum ratings that the specific genres received and which genres have the highest user rating?
    4. What is the type of an animes that got the highest ratings?
    5. What is the maximum number of episodes and which anime it is with the maximum number of episodes?
    6. Is there any relationship between number of episodes and rating?
    7. Is there any relationship between the number of members and rating?

    PREPARE - LOADING THE DATA SET TO R, EDA

    To load CSV files to R core R function read_csv function from the readr package was used which is the part of the tidyverse collection of packages. Also, psych, ggthemes, e1071 packages was loaded to conduct auxiliary works on data frame:

    # Load packages
    library(tidyverse)
    library(psych)
    library(ggthemes)
    library(e1071)
    library(pacman)
# Load packages needed for wordcloud creation
pacman::p_load("tm",
               "SnowballC",
               "wordcloud",
               "RColorBrewer",                 
               "RCurl",
               "XML")
# Load data sets into R
anime <- read_csv(file = "anime.csv", 
                  na = c("", "Unknown"))
Rows: 12294 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): name, genre, type
dbl (4): anime_id, episodes, rating, members

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rating <- read.csv(file = "rating.csv")

EDA

After loading data sets, different R functions were applied to the whole data frames and/or specific variables to better understand the information they contain.

Exploration of Anime data set

# View first few lines to know what kind of information a data frame contains
head(anime)
# A tibble: 6 × 7
  anime_id name                              genre type  episodes rating members
     <dbl> <chr>                             <chr> <chr>    <dbl>  <dbl>   <dbl>
1    32281 Kimi no Na wa.                    Dram… Movie        1   9.37  200630
2     5114 Fullmetal Alchemist: Brotherhood  Acti… TV          64   9.26  793665
3    28977 Gintama°                          Acti… TV          51   9.25  114262
4     9253 Steins;Gate                       Sci-… TV          24   9.17  673572
5     9969 Gintama&#039;                     Acti… TV          51   9.16  151266
6    32935 Haikyuu!!: Karasuno Koukou VS Sh… Come… TV          10   9.15   93351
# Get summary statistics of the data frame
summary(anime)
    anime_id         name              genre               type          
 Min.   :    1   Length:12294       Length:12294       Length:12294      
 1st Qu.: 3484   Class :character   Class :character   Class :character  
 Median :10260   Mode  :character   Mode  :character   Mode  :character  
 Mean   :14058                                                           
 3rd Qu.:24794                                                           
 Max.   :34527                                                           
                                                                         
    episodes           rating          members       
 Min.   :   1.00   Min.   : 1.670   Min.   :      5  
 1st Qu.:   1.00   1st Qu.: 5.880   1st Qu.:    225  
 Median :   2.00   Median : 6.570   Median :   1550  
 Mean   :  12.38   Mean   : 6.474   Mean   :  18071  
 3rd Qu.:  12.00   3rd Qu.: 7.180   3rd Qu.:   9437  
 Max.   :1818.00   Max.   :10.000   Max.   :1013917  
 NA's   :340       NA's   :230                       
# Check the data types of each column
str(anime)
spc_tbl_ [12,294 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ anime_id: num [1:12294] 32281 5114 28977 9253 9969 ...
 $ name    : chr [1:12294] "Kimi no Na wa." "Fullmetal Alchemist: Brotherhood" "Gintama°" "Steins;Gate" ...
 $ genre   : chr [1:12294] "Drama, Romance, School, Supernatural" "Action, Adventure, Drama, Fantasy, Magic, Military, Shounen" "Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen" "Sci-Fi, Thriller" ...
 $ type    : chr [1:12294] "Movie" "TV" "TV" "TV" ...
 $ episodes: num [1:12294] 1 64 51 24 51 10 148 110 1 13 ...
 $ rating  : num [1:12294] 9.37 9.26 9.25 9.17 9.16 9.15 9.13 9.11 9.1 9.11 ...
 $ members : num [1:12294] 200630 793665 114262 673572 151266 ...
 - attr(*, "spec")=
  .. cols(
  ..   anime_id = col_double(),
  ..   name = col_character(),
  ..   genre = col_character(),
  ..   type = col_character(),
  ..   episodes = col_double(),
  ..   rating = col_double(),
  ..   members = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
# number of columns and rows
ncol(anime)
[1] 7
nrow(anime)
[1] 12294
# Names of columns
colnames(anime)
[1] "anime_id" "name"     "genre"    "type"     "episodes" "rating"   "members" 

Genre

In anime data set one of the most interesting variable is genre column. It is quite complicated column because of the general difficulty to attach one anime to the single genre. Instead, one anime has multiple genres, for instance, ‘Action, Adventure, Comedy’ or ‘Romance, Fantasy, Ecci, Sci-Fi’ etc. So, the unique list of these combinations were saved in a vector:

anime_genres <- unique(anime$genre)

To get the real genre names that form these combinations first these combinations collapsed into the pieces based on the comma separating them. Then, from these collapsed words unique elements combined into a vector

In order to solve this problem, data in genre column should be split into single characters and then unique characters should be combined again into a vector:

# Split the genre column into a list of genres for each row
genre_lists <- strsplit(anime$genre, ", ")

# Combine all unique genres into one vector
unique_genres <- unique(unlist(genre_lists))

# Print the unique genres
unique_genres
 [1] "Drama"         "Romance"       "School"        "Supernatural" 
 [5] "Action"        "Adventure"     "Fantasy"       "Magic"        
 [9] "Military"      "Shounen"       "Comedy"        "Historical"   
[13] "Parody"        "Samurai"       "Sci-Fi"        "Thriller"     
[17] "Sports"        "Super Power"   "Space"         "Slice of Life"
[21] "Mecha"         "Music"         "Mystery"       "Seinen"       
[25] "Martial Arts"  "Vampire"       "Shoujo"        "Horror"       
[29] "Police"        "Psychological" "Demons"        "Ecchi"        
[33] "Josei"         "Shounen Ai"    "Game"          "Dementia"     
[37] "Harem"         "Cars"          "Kids"          "Shoujo Ai"    
[41] NA              "Hentai"        "Yaoi"          "Yuri"         
# Get the length of the new vector which is the count of the real unique genres
length(unique_genres)
[1] 44

So, the number of unique genres is actually 44. But as shown in the data set it is difficult to attach one genre to the single anime. Hence, genre column consists of the genres that are the combination of these 44 genres. Of course, there are also mangas in the original data set that have just one single genre.

In order to identify the position of the single genre inside combinations the frequency table and wordcloud were considered useful:

script <- "wordcloud.R"
source(script)

word_cloud <- rquery.wordcloud(anime_genres, 
                      type ="text", 
                      lang = "english",
                      textStemming=FALSE,
                      min.freq=1, 
                      max.words=2000)

freq_table <- word_cloud$freqTable
freq_table
                       word freq
comedy               comedy 1288
action               action 1194
drama                 drama  848
adventure         adventure  835
scifi                 scifi  832
fantasy             fantasy  814
romance             romance  768
shounen             shounen  707
supernatural   supernatural  603
school               school  483
magic                 magic  377
mecha                 mecha  365
historical       historical  339
ecchi                 ecchi  335
shoujo               shoujo  326
mystery             mystery  295
life                   life  293
slice                 slice  293
kids                   kids  277
seinen               seinen  262
power                 power  249
super                 super  249
horror               horror  233
military           military  216
demons               demons  188
music                 music  177
harem                 harem  177
space                 space  174
arts                   arts  171
martial             martial  171
psychological psychological  162
parody               parody  161
sports               sports  157
hentai               hentai  136
police               police  104
samurai             samurai   95
game                   game   95
vampire             vampire   74
thriller           thriller   66
dementia           dementia   53
josei                 josei   34
cars                   cars   34
yaoi                   yaoi   21
yuri                   yuri   18

So, the combinations mostly contains the genres and topics that are comedy, action, drama, adventure and so on.

Another method to understand this column is to group and summarize animes by genre column:

genres_group <- anime |> 
  group_by(genre) |> 
  summarise(anime_count = n()) |> 
  arrange(desc(anime_count))

genres_group
# A tibble: 3,265 × 2
   genre                 anime_count
   <chr>                       <int>
 1 Hentai                        823
 2 Comedy                        523
 3 Music                         301
 4 Kids                          199
 5 Comedy, Slice of Life         179
 6 Dementia                      137
 7 Fantasy, Kids                 128
 8 Fantasy                       114
 9 Comedy, Kids                  112
10 Drama                         107
# ℹ 3,255 more rows
# Summary statistic of the number of animes related to the genres
describe(genres_group$anime_count)
   vars    n mean    sd median trimmed mad min max range  skew kurtosis   se
X1    1 3265 3.77 20.03      1    1.67   0   1 823   822 28.38  1015.21 0.35

Grouped data was then filtered to contain genres that have been attached to the 50 and more animes:

# Create a filtered version of grouped data
genres_group_50 <- genres_group |> 
  filter(anime_count >= 50)
# Create a bar chart to visualize genre and their counts based on grouped data
ggplot(data = genres_group_50,
                    aes(x = as_factor(genre), 
                        y = anime_count)) +
  geom_bar(stat = "identity", 
           fill = "blue") +
  coord_flip() +
  labs(x = "Genre",
       y = "Number of animes",
       title = "Anime Genres and their representation",
       fill = "Count") + 
  theme_clean(base_size = 12, 
              base_family = "sans")

Both grouped data and visualization shows that there are 62 animes without genres mentioned. This problem needs to be addressed in data processing phase of the analysis.

# Presentation of the number of animes
ggplot(genres_group_50, aes(x = anime_count)) + 
  geom_boxplot(outlier.colour = "red",
               outlier.shape = 4) +
  theme_clean(base_size = 12, 
              base_family = "sans") +
  labs(title = "Distribution of animes by genre",
       x = "Number of animes")

At this point of the analysis there is already an answer to the first question: Even though the comedy genre are the mostly mentioned genre in combinations, Hentai genre has more mangas attached to it in the data set followed by Comedy and Music. So, this is the answer to the first question asked:

  1. What are the most popular genres for anime (Number of mangas represented)?

Type

Another variable that contains valuable information about the animes is the `type` variable. It represents a type of media that animes have been introduced like TV series, movie, OVA, ONA etc:

unique(anime$type)
[1] "Movie"   "TV"      "OVA"     "Special" "Music"   "ONA"     NA       
# Grouping based on types
types_group <- anime |> 
  group_by(type) |> 
  summarise(anime_count = n()) |>  
  arrange(desc(anime_count))

types_group
# A tibble: 7 × 2
  type    anime_count
  <chr>         <int>
1 TV             3787
2 OVA            3311
3 Movie          2348
4 Special        1676
5 ONA             659
6 Music           488
7 <NA>             25

Grouping reveals that most of the animes have been introduced in TV format. OVA and movies are the second and the third most widely used types of animes. There are also 25 animes without any type values:

anime |> 
  filter(is.na(type))
# A tibble: 25 × 7
   anime_id name                             genre type  episodes rating members
      <dbl> <chr>                            <chr> <chr>    <dbl>  <dbl>   <dbl>
 1    30484 Steins;Gate 0                    Sci-… <NA>        NA     NA   60999
 2    34437 Code Geass: Fukkatsu no Lelouch  Acti… <NA>        NA     NA   22748
 3    33352 Violet Evergarden                Dram… <NA>        NA     NA   20564
 4    33248 K: Seven Stories                 Acti… <NA>        NA     NA   22133
 5    33845 Free! (Shinsaku)                 Scho… <NA>        NA     NA    8666
 6    33475 Busou Shoujo Machiavellianism    Acti… <NA>        NA     NA    1896
 7    31456 Code:Realize: Sousei no Himegimi Adve… <NA>        NA     NA    4017
 8    34332 Flying Babies                    <NA>  <NA>        NA     NA      22
 9    34280 Gamers!                          Come… <NA>        NA     NA    1045
10    34485 Ganko-chan                       <NA>  <NA>        NA     NA      11
# ℹ 15 more rows

Type, the number of episodes and rating information were missing for these 25 animes. As they were very small portion of the data best way was considered to filter them out:

types_group <- anime |> 
  filter(!is.na(type)) |> 
  group_by(type) |> 
  summarise(anime_count = n()) |>  
  arrange(desc(anime_count))

types_group
# A tibble: 6 × 2
  type    anime_count
  <chr>         <int>
1 TV             3787
2 OVA            3311
3 Movie          2348
4 Special        1676
5 ONA             659
6 Music           488
# Create a bar chart to visualize types and their counts based on grouped data
ggplot(data = types_group, 
                   aes(x = as_factor(type),
                       y = anime_count)) + 
  geom_bar(stat = "identity", 
                    fill ="blue") +
  labs(x = "Type",
       y = "Number of animes",
       title = "Anime types and their representation") + 
  theme_clean(base_size = 12, 
              base_family = "sans")

# Presentation of the number of animes
ggplot(types_group, aes(x = anime_count)) +
  geom_boxplot(outlier.colour = "red",
               outlier.shape = 4) +
  theme_clean(base_size = 12,
              base_family = "sans") +
  labs(title = "Distribution of animes by types",
       x = "Number of animes")

The number of animes attached to each type seem to be more normally distributed than of genres.

Episodes

Episodes column had a data type of character which makes it less useful for analysis. In order to, conduct mathematical operations on this variable its data type was needed to be converted to double. But first, it would be interesting to check if it was containing any NA values as its raw form:

# Check data type of episodes
typeof(anime$episodes)
[1] "double"
# Check numbers of NAs
sum(is.na(anime$episodes))
[1] 340

So, there is no NA values in it. Now, it is time to view the unique values of the variable:

unique(anime$episodes)
  [1]    1   64   51   24   10  148  110   13  201   25   22   75    4   26   12
 [16]   27   43   74   37    2   11   99   NA   39  101   47   50   62   33  112
 [31]   23    3   94    6    8   14    7   40   15  203   77  291  120  102   96
 [46]   38   79  175  103   70  153   45    5   21   63   52   28  145   36   69
 [61]   60  178  114   35   61   34  109   20    9   49  366   97   48   78  358
 [76]  155  104  113   54  167  161   42  142   31  373  220   46  195   17 1787
 [91]   73  147  127   16   19   98  150   76   53  124   29  115  224   44   58
[106]   93  154   92   67  172   86   30  276   59   72  330   41  105  128  137
[121]   56   55   65  243  193   18  191  180   91  192   66  182   32  164  100
[136]  296  694   95   68  117  151  130   87  170  119   84  108  156  140  331
[151]  305  300  510  200   88 1471  526  143  726  136 1818  237 1428  365  163
[166]  283   71  260  199  225  312  240 1306 1565  773 1274   90  475  263   83
[181]   85 1006   80  162  132  141  125

But it contains the string “Unknown” which will be converted to NA if as.double() function is applied to change the data type.

total_unknown <- sum(anime$episodes == "Unknown")
total_unknown
[1] NA

The Data frame was removed and imported again to consider “Unknown” values as NAs. This was done in the same script above and after importing it type conversion was conducted:

# Convert episodes to the number type
anime$episodes <- as.double(anime$episodes)

So, 340 “Unknown” values are now NAs which is R’s way to denote unknown values. Conversion completed successfully and now statistical functions can be applied to this variable:

# Minimum number of episodes
min(anime$episodes, na.rm = TRUE)
[1] 1
# Maximum number of episodes
max(anime$episodes, na.rm = TRUE)
[1] 1818
# Descriptive statistics
describe(anime$episodes)
   vars     n  mean    sd median trimmed  mad min  max range  skew kurtosis
X1    1 11954 12.38 46.87      2    5.96 1.48   1 1818  1817 23.38   732.85
     se
X1 0.43
# Grouping based on episodes, name, genre, type
episodes_group <- anime |> 
  group_by(episodes, name, genre, type) |>  
  summarise(
    anime_count = n(),
    .groups = "keep") |> 
  arrange(desc(episodes))

episodes_group
# A tibble: 12,294 × 5
# Groups:   episodes, name, genre, type [12,294]
   episodes name                                  genre        type  anime_count
      <dbl> <chr>                                 <chr>        <chr>       <int>
 1     1818 Oyako Club                            Comedy, Sli… TV              1
 2     1787 Doraemon (1979)                       Adventure, … TV              1
 3     1565 Kirin Monoshiri Yakata                Kids         TV              1
 4     1471 Manga Nippon Mukashibanashi (1976)    Fantasy, Hi… TV              1
 5     1428 Hoka Hoka Kazoku                      Comedy       TV              1
 6     1306 Kirin Ashita no Calendar              Historical,… TV              1
 7     1274 Monoshiri Daigaku: Ashita no Calendar Historical   TV              1
 8     1006 Sekai Monoshiri Ryoko                 Comedy       TV              1
 9      773 Kotowaza House                        Comedy, Sli… TV              1
10      726 Shima Shima Tora no Shimajirou        Adventure, … TV              1
# ℹ 12,284 more rows
# filter animes with episodes count greater than 100
episodes_group |> 
  filter(episodes > 100) |>
  # create visualization based on this filter to see the distribution of episodes count
ggplot(aes(x = episodes)) +
  geom_histogram(binwidth = 50,
                 fill = "blue") +
  labs(
    title = "Distribution of episodes",
       x = "Number of episodes",
       y = "Frequency"
    ) +
  theme_clean(base_size = 12,
              base_family = "sans")

There are only a few animes that have 1000 and more episodes. Most of the animes in this data set have 1 and 2 episodes. The episode count of series ranges between 2 and 1818. So, maximum number of episodes is 1818 which belongs to the anime “Oyako Club” written in the genre “Comedy, Slice of Life” which is the answer to our fifth question:

  1. What is the maximum number of episodes and which anime it is with the maximum number of episodes?

Rating

This column is the collection of the average ratings animes got from users. The real ratings from the users have been collected in the rating.csv file which was joined the the anime data frame in the later parts of the analysis. But it is still interesting column, so, observations were explored in this part too:

describe(anime$rating)
   vars     n mean   sd median trimmed  mad  min max range  skew kurtosis   se
X1    1 12064 6.47 1.03   6.57    6.53 0.96 1.67  10  8.33 -0.54     0.51 0.01
# Presentation of ratings
ggplot(anime, aes(x = rating)) +
  geom_boxplot(outlier.colour = "red",
               outlier.shape = 4) +
  theme_clean(base_size = 12,
              base_family = "sans") +
  labs(title = "Average ratings of animes",
       x = "Average rating")

Most animes have got ratings between 6 and 7. There are both negative and positive outliers in this variable. The distribution of the rating is left-skewed:

skewness(anime$rating, na.rm = TRUE)
[1] -0.5434349
ggplot(anime, aes(x = rating)) +
  geom_histogram(binwidth = 0.5,
                 fill = "blue",
                 color =  "black") +
  labs(title = "Distribution of rating",
       x = "Rating", 
       y = "Frequency") +
  theme_clean(base_size = 12,
              base_family = "sans")

Members

This variable is about online groups of animes. It represents how many members the group that dedicated to the single anime has:

describe(anime$members)
   vars     n     mean       sd median trimmed     mad min     max   range skew
X1    1 12294 18071.34 54820.68   1550 5589.94 2172.01   5 1013917 1013912 6.68
   kurtosis     se
X1    62.82 494.42

So, this variable is also full of outliers and strangeness: It has minimum member count 5 and maximum of 1013912. Better way to work with variable was thought to be to filter out smaller member groups.

ggplot(anime, 
       aes(x = members)) +
  geom_boxplot(outlier.colour = "red", 
                           outlier.shape = 4) +
  theme_clean(base_size = 12, 
              base_family = "sans") +
  labs(title = "Member groups and the member count of them",
       x = "Members")

Exploration of rating data set

rating data frame has only three columns. Most important ones are user_id - which will be used to join it with anime data frame and rating which is the real rating each user rated the specific anime.

# View first few lines to know what kind of information a data frame contains
head(rating)
  user_id anime_id rating
1       1       20     -1
2       1       24     -1
3       1       79     -1
4       1      226     -1
5       1      241     -1
6       1      355     -1
# Get summary statistics of the dataset
summary(rating)
    user_id         anime_id         rating      
 Min.   :    1   Min.   :    1   Min.   :-1.000  
 1st Qu.:18974   1st Qu.: 1240   1st Qu.: 6.000  
 Median :36791   Median : 6213   Median : 7.000  
 Mean   :36728   Mean   : 8909   Mean   : 6.144  
 3rd Qu.:54757   3rd Qu.:14093   3rd Qu.: 9.000  
 Max.   :73516   Max.   :34519   Max.   :10.000  

As noted at the Introduction section rating -1 meant no rating. It caused confusion in the analysis, so, it was converted to more “useful” value in data processing section.

# Presentation of ratings
# Confusing vizualiation: -1 is also displayed as a legal value
ggplot(rating, aes(x = rating)) +
  geom_boxplot(outlier.colour = "red", 
               outlier.shape = 4) +
  theme_clean(base_size = 12,
              base_family = "sans") +
  labs(title = "Ratings of animes", 
       subtitle = "Ratings are integers from 1 to 10", 
       x = "Rating")

# Check the data types of each column
str(rating)
'data.frame':   7813737 obs. of  3 variables:
 $ user_id : int  1 1 1 1 1 1 1 1 1 1 ...
 $ anime_id: int  20 24 79 226 241 355 356 442 487 846 ...
 $ rating  : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
# number of columns and rows
ncol(rating)
[1] 3
nrow(rating)
[1] 7813737
# Names of columns
colnames(rating)
[1] "user_id"  "anime_id" "rating"  

Joined data frame

After exploring two data sets separately, they were joined based on anime_id column:

animes_joined <- inner_join(anime, 
                            rating, 
                            by = "anime_id")
head(animes_joined)
# A tibble: 6 × 9
  anime_id name           genre type  episodes rating.x members user_id rating.y
     <dbl> <chr>          <chr> <chr>    <dbl>    <dbl>   <dbl>   <int>    <int>
1    32281 Kimi no Na wa. Dram… Movie        1     9.37  200630      99        5
2    32281 Kimi no Na wa. Dram… Movie        1     9.37  200630     152       10
3    32281 Kimi no Na wa. Dram… Movie        1     9.37  200630     244       10
4    32281 Kimi no Na wa. Dram… Movie        1     9.37  200630     271       10
5    32281 Kimi no Na wa. Dram… Movie        1     9.37  200630     278       -1
6    32281 Kimi no Na wa. Dram… Movie        1     9.37  200630     322       10
str(animes_joined)
spc_tbl_ [7,813,727 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ anime_id: num [1:7813727] 32281 32281 32281 32281 32281 ...
 $ name    : chr [1:7813727] "Kimi no Na wa." "Kimi no Na wa." "Kimi no Na wa." "Kimi no Na wa." ...
 $ genre   : chr [1:7813727] "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" ...
 $ type    : chr [1:7813727] "Movie" "Movie" "Movie" "Movie" ...
 $ episodes: num [1:7813727] 1 1 1 1 1 1 1 1 1 1 ...
 $ rating.x: num [1:7813727] 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 ...
 $ members : num [1:7813727] 200630 200630 200630 200630 200630 ...
 $ user_id : int [1:7813727] 99 152 244 271 278 322 398 462 490 548 ...
 $ rating.y: int [1:7813727] 5 10 10 10 -1 10 10 8 10 10 ...
 - attr(*, "spec")=
  .. cols(
  ..   anime_id = col_double(),
  ..   name = col_character(),
  ..   genre = col_character(),
  ..   type = col_character(),
  ..   episodes = col_double(),
  ..   rating = col_double(),
  ..   members = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

PROCESS - DATA CLEANING

Before this phase some basic data cleaning had already been applied to the anime data frame separately:

  • Data type of episodes variable has been converted from character to double.

It means when it was joined with rating data frame, rating variable carried its data type and values all together. So, all those -1 values are present in rating column of joined data frame. NAs. can be treated differently based on the type, purpose, reliability and other factors. In this specific analysis they were converted to average values of the rating.

But before that one more problem with the joined data frame was so obvious: Column names rating.x, rating.y were not user friendly, so, they were replaced with the descriptive ones:

# Rename rating.x, rating.y variables
animes_joined <- animes_joined |> 
  rename(
    average_rating = rating.x, 
         user_rating = rating.y
    )
# Check data frame after renaming
str(animes_joined)
spc_tbl_ [7,813,727 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ anime_id      : num [1:7813727] 32281 32281 32281 32281 32281 ...
 $ name          : chr [1:7813727] "Kimi no Na wa." "Kimi no Na wa." "Kimi no Na wa." "Kimi no Na wa." ...
 $ genre         : chr [1:7813727] "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" ...
 $ type          : chr [1:7813727] "Movie" "Movie" "Movie" "Movie" ...
 $ episodes      : num [1:7813727] 1 1 1 1 1 1 1 1 1 1 ...
 $ average_rating: num [1:7813727] 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 ...
 $ members       : num [1:7813727] 200630 200630 200630 200630 200630 ...
 $ user_id       : int [1:7813727] 99 152 244 271 278 322 398 462 490 548 ...
 $ user_rating   : int [1:7813727] 5 10 10 10 -1 10 10 8 10 10 ...
 - attr(*, "spec")=
  .. cols(
  ..   anime_id = col_double(),
  ..   name = col_character(),
  ..   genre = col_character(),
  ..   type = col_character(),
  ..   episodes = col_double(),
  ..   rating = col_double(),
  ..   members = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

In order to proceed to replace -1 values with averages, unique values in the user_rating variable were displayed. Then, to check the effectiveness of the replacement, total number of -1 ratings has been identified to compare the before and after results:

# Distinct values in user.rating column
unique(animes_joined$user_rating)
 [1]  5 10 -1  8  9  7  6  2  4  1  3
# Total number of -1 values
abs(
  sum(
    animes_joined$user_rating[animes_joined$user_rating == -1]
    )
  )
[1] 1476488
# Replacement of -1 with mean value
animes_joined$user_rating[animes_joined$user_rating == -1] <- round(
  mean(animes_joined$user_rating,
       na.rm = TRUE),
  2
  )
# Distinct values in user.rating column after the replacement
unique(animes_joined$user_rating)
 [1]  5.00 10.00  6.14  8.00  9.00  7.00  6.00  2.00  4.00  1.00  3.00
# Total number of NAs after the replacement
sum(is.na(animes_joined$user_rating))
[1] 0

After this replacement the box plot recreated to check if there are still other anomalies in this column:

ggplot(animes_joined, 
       aes(x = user_rating)) +
geom_boxplot(outlier.colour = "red", 
             outlier.shape = 4) +
  theme_clean(base_size = 12,
              base_family = "sans") +
  labs(title = "Ratings of animes",
       subtitle = "Ratings are integers from 1 to 10",
       x = "Rating")

ANALYSIS

Ratings of the genres

To get the summary of the ratings based on the genres, joined data frame was grouped:

genres_summary <- animes_joined |> 
  group_by(genre) |> 
  summarise(
    avg_rating = round(
      mean(user_rating,
      na.rm = TRUE),
      2
      ),
    min_rating = min(user_rating,
                     na.rm = TRUE),
    max_rating = max(user_rating,
                     na.rm = TRUE)
  ) |> 
  arrange(desc(avg_rating))

genres_summary
# A tibble: 3,155 × 4
   genre                                        avg_rating min_rating max_rating
   <chr>                                             <dbl>      <dbl>      <dbl>
 1 Action, Historical, Kids                          10            10         10
 2 Action, Adventure, Drama, Fantasy, Magic, M…       8.92          1         10
 3 Drama, Fantasy, Romance, Slice of Life, Sup…       8.77          1         10
 4 Action, Drama, Mecha, Military, Sci-Fi, Sup…       8.68          1         10
 5 Action, Comedy, Historical, Parody, Samurai…       8.66          1         10
 6 Drama, Music, Romance, School, Shounen             8.63          1         10
 7 Sci-Fi, Thriller                                   8.62          1         10
 8 Drama, Horror, Mystery, Police, Psychologic…       8.61          1         10
 9 Drama, Romance, School, Supernatural               8.61          1         10
10 Action, Mecha, Military, School, Sci-Fi, Su…       8.57          1         10
# ℹ 3,145 more rows

So, mangas in Action, Historical, Kids genre revealed interesting result: Their minimum, maximum ratings are 10 out of 10. Data was filtered to show the mangas in this genre:

animes_joined |> 
  filter(genre == "Action, Historical, Kids")
# A tibble: 1 × 9
  anime_id name  genre type  episodes average_rating members user_id user_rating
     <dbl> <chr> <chr> <chr>    <dbl>          <dbl>   <dbl>   <int>       <dbl>
1    33484 Shir… Acti… Movie        1           4.71      45   69497          10

Only 1 manga in this genre was represented and it got 10 from the only user that watched it. It seemed suspicious so, better approach was considered to filter out this genre:

genres_summary <- animes_joined |> 
  filter(genre != "Action, Historical, Kids") |>
  group_by(genre) |> 
  summarise(
    avg_rating = round(
      mean(
        user_rating, 
        na.rm = TRUE), 
      2
      ),
    min_rating = min(user_rating,
                     na.rm = TRUE),
    max_rating = max(user_rating, 
                     na.rm = TRUE)
  ) |>
  arrange(genre)
#   arrange(desc(avg_rating))

genres_summary
# A tibble: 3,153 × 4
   genre                                        avg_rating min_rating max_rating
   <chr>                                             <dbl>      <dbl>      <dbl>
 1 Action                                             6.69          1         10
 2 Action, Adventure                                  7.01          1         10
 3 Action, Adventure, Cars, Comedy, Sci-Fi, Sh…       6.89          2         10
 4 Action, Adventure, Cars, Mecha, Sci-Fi, Sho…       5.98          1         10
 5 Action, Adventure, Cars, Sci-Fi                    6.75          1         10
 6 Action, Adventure, Comedy                          6.8           1         10
 7 Action, Adventure, Comedy, Demons, Drama, E…       7.04          3         10
 8 Action, Adventure, Comedy, Demons, Fantasy,…       7.19          2         10
 9 Action, Adventure, Comedy, Demons, Fantasy,…       7.79          1         10
10 Action, Adventure, Comedy, Demons, Fantasy,…       6.91          1         10
# ℹ 3,143 more rows

So, even Hentai genre have more mangas, action mangas got the highest ratings which answers the third question of the analysis:

  1. What are the maximum and minimum ratings that the specific genres received and which genres have the highest user rating?

Ratings of the types

As animes were represented via different media forms, the rating based on these media forms summarized and visualized:

types_summary <- 
  animes_joined |> 
  group_by(type) |> 
  summarise(
    avg_rating = round(
      mean(
        user_rating,
        na.rm = TRUE), 
      2
      ),
    min_rating = min(user_rating,
                     na.rm = TRUE),
    max_rating = max(user_rating,
                     na.rm = TRUE)
  ) %>% 
  arrange(desc(avg_rating))

types_summary
# A tibble: 7 × 4
  type    avg_rating min_rating max_rating
  <chr>        <dbl>      <dbl>      <dbl>
1 <NA>          8.5           7          9
2 TV            7.59          1         10
3 Movie         7.57          1         10
4 Special       7.14          1         10
5 OVA           7.06          1         10
6 ONA           6.99          1         10
7 Music         6.95          1         10

Again the unknown type of animes was present here and it got the highest average rating. Filtering revealed that it was the same mange named Steins;Gate 0 with lacking information about type, episodes and average rating and got 7 and 9 from only 5 users. It was also filtered out:

types_summary <-
  animes_joined |> 
  filter(type != "") |>
  group_by(type) |> 
  summarise(
    avg_rating = round(
    mean(user_rating,
        na.rm = TRUE),
    2
    ),
    min_rating = min(user_rating, 
                     na.rm = TRUE),
    max_rating = max(user_rating, 
                     na.rm = TRUE)
  ) |> 
  arrange(desc(avg_rating))

types_summary
# A tibble: 6 × 4
  type    avg_rating min_rating max_rating
  <chr>        <dbl>      <dbl>      <dbl>
1 TV            7.59          1         10
2 Movie         7.57          1         10
3 Special       7.14          1         10
4 OVA           7.06          1         10
5 ONA           6.99          1         10
6 Music         6.95          1         10
# Highest rated formats for mangas
ggplot(data = types_summary, 
                   aes(x = as_factor(type),
                       y = avg_rating)) +
geom_bar(stat = "identity", 
                    fill ="blue") +
  labs(x = "Type",
       y = "Average rating",
       title = "Anime types and average rating") + 
  theme_clean(base_size = 12,
              base_family = "sans")

So, beside small differences all formats seem to get higher user ratings which is the answer to the fourth question:

  1. What is the type of an animes that got the highest ratings?

Relationship between the number of episodes and rating

ggplot(animes_joined, mapping = aes(
  x = cut(episodes, breaks = c(1, 1000, 2000)), 
  y = user_rating)
  ) + 
  geom_violin(fill = "red") +
  scale_x_discrete(
    labels = c("1-500", 
               "500-1000", 
               "1000-1900")) +
   labs(title = "Episodes and rating", 
       x = "Number of episodes", 
       y = "Rating") +
  theme_clean(base_size = 12, 
              base_family = "sans")

Here there are three groups:

  1. Mangas with episode count between 1 to 500

  2. Episode count between 500 to the 1000

  3. Episode count between 1000 to the 1900 (1818 is the maximum)

In each group most of the mangas received the rating above 5, so the number of episodes does not have correlation with the user rating.

Relationship between the number of members and rating

ggplot(animes_joined, mapping =
         aes(x = cut(members, breaks = 
                  c(1, 
                    500000, 
                    1000000,
                    1014000)), 
        y = user_rating)) + 
  geom_violin(fill = "red") +
  scale_x_discrete(
    labels = c("1-500000",
               "500000-1000000",
               "1000000-1014000")) +
   labs(title = "Number of members and rating", 
       x = "Members", 
       y = "Rating") +
  theme_clean(base_size = 12, 
              base_family = "sans")

Here also, mangas have been grouped into three groups:

  1. Groups with member count up to 500000

  2. Member count from 500000 to 1000000

  3. Member count between 1000000 to 1014000 (1013917 is the maximum)

In each group most of the mangas received the rating above 5. Also, there are more mangas starting with the rating 5 in user groups with members up to 500000 than others. So the number of members in user groups does not have correlation with the user rating.

Conclusion, limitations and notes

As mentioned above, genres of animes are quite complicated. The data set provided contained most animes in Hentai, comedy and music. But the combinations of these genres are different. In this case comedy, action, drama, adventure, scifi genres have more mangas.

Any format of the manga are popular and mangas in all formats received almost equal ratings from the users.

There are no relationship between episodes count, number of members and user rating.

The major limitation of the data set is that it does not contain the information about mangas in paper format.

Notes:

This analysis was written in Quarto markdown document. Viewer should have at least basic understanding of R programming language, pipe operator, visualization with ggplot2 and other tools such as functions used in the analysis.