# Load packages
library(tidyverse)
library(psych)
library(ggthemes)
library(e1071)
library(pacman)Anime Data Project
INTRODUCTION
When I worked at the book store, young people, especially teenagers used to come and ask about animes. Some of them had watched TV episodes of these animes and now wanted to try out the book version of them. Personally, I was not a fan of animes but as time went by I started to shift my attitude to them. So, when I found this data set on Kaggle, I decided to pick it up and try to better understand anime as an artistic genre and if possible get answers to some of the general questions like what attracts teenagers to this genre and some of the specific questions like what are the most popular topics and what are the most popular sub-genres like crime, horror, sci-fi and so on. Of course, when the data set was explored in deep, more questions arises and I tried to write them down as well, and answer them.
As mentioned above the data set was downloaded from the data set section of Kaggle website. In the about and context section of data set it is informed that it contains “information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings”.
There two CSV files in the data set:
anime.csv
rating.csv
anime.csv file have the below variables and corresponding observations:
anime_id - myanimelist.net’s unique id identifying an anime.
name - full name of anime.
genre - comma separated list of genres for this anime.
type - movie, TV, OVA, etc.
episodes - how many episodes in this show. (1 if movie).
rating - average rating out of 10 for this anime.
members - number of community members that are in this anime’s “group”.
rating.csv file have three variables:
user_id - non identifiable randomly generated user id.
anime_id - the anime that this user has rated.
rating - rating out of 10 this user has assigned (-1 if the user watched it but didn’t assign a rating).
Some of the terms related to the context:
OVA - original video animation
ONA - original net animation
Special - (aka TV Special) is not weekly. Usually yearly or one shot. It’s have only one episode but it’s have longer length (ex 2 hours). It’s still intended for broadcast. Need to meet broadcast code.
ASK - QUESTIONS TO ANSWER
These were the questions considered the most important to answer and get insights about:
- What are the most popular genres for animes (Number of mangas represented)?
- What are the most popular types for animes (Number of mangas represented)?
- What are the maximum and minimum ratings that the specific genres received and which genres have the highest user rating?
- What is the type of an animes that got the highest ratings?
- What is the maximum number of episodes and which anime it is with the maximum number of episodes?
- Is there any relationship between number of episodes and rating?
- Is there any relationship between the number of members and rating?
PREPARE - LOADING THE DATA SET TO R, EDA
To load CSV files to R core R function
read_csvfunction from thereadrpackage was used which is the part of thetidyversecollection of packages. Also,psych,ggthemes,e1071packages was loaded to conduct auxiliary works on data frame:
# Load packages needed for wordcloud creation
pacman::p_load("tm",
"SnowballC",
"wordcloud",
"RColorBrewer",
"RCurl",
"XML")# Load data sets into R
anime <- read_csv(file = "anime.csv",
na = c("", "Unknown"))Rows: 12294 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): name, genre, type
dbl (4): anime_id, episodes, rating, members
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rating <- read.csv(file = "rating.csv")EDA
After loading data sets, different R functions were applied to the whole data frames and/or specific variables to better understand the information they contain.
Exploration of Anime data set
# View first few lines to know what kind of information a data frame contains
head(anime)# A tibble: 6 × 7
anime_id name genre type episodes rating members
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 32281 Kimi no Na wa. Dram… Movie 1 9.37 200630
2 5114 Fullmetal Alchemist: Brotherhood Acti… TV 64 9.26 793665
3 28977 Gintama° Acti… TV 51 9.25 114262
4 9253 Steins;Gate Sci-… TV 24 9.17 673572
5 9969 Gintama' Acti… TV 51 9.16 151266
6 32935 Haikyuu!!: Karasuno Koukou VS Sh… Come… TV 10 9.15 93351
# Get summary statistics of the data frame
summary(anime) anime_id name genre type
Min. : 1 Length:12294 Length:12294 Length:12294
1st Qu.: 3484 Class :character Class :character Class :character
Median :10260 Mode :character Mode :character Mode :character
Mean :14058
3rd Qu.:24794
Max. :34527
episodes rating members
Min. : 1.00 Min. : 1.670 Min. : 5
1st Qu.: 1.00 1st Qu.: 5.880 1st Qu.: 225
Median : 2.00 Median : 6.570 Median : 1550
Mean : 12.38 Mean : 6.474 Mean : 18071
3rd Qu.: 12.00 3rd Qu.: 7.180 3rd Qu.: 9437
Max. :1818.00 Max. :10.000 Max. :1013917
NA's :340 NA's :230
# Check the data types of each column
str(anime)spc_tbl_ [12,294 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ anime_id: num [1:12294] 32281 5114 28977 9253 9969 ...
$ name : chr [1:12294] "Kimi no Na wa." "Fullmetal Alchemist: Brotherhood" "Gintama°" "Steins;Gate" ...
$ genre : chr [1:12294] "Drama, Romance, School, Supernatural" "Action, Adventure, Drama, Fantasy, Magic, Military, Shounen" "Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen" "Sci-Fi, Thriller" ...
$ type : chr [1:12294] "Movie" "TV" "TV" "TV" ...
$ episodes: num [1:12294] 1 64 51 24 51 10 148 110 1 13 ...
$ rating : num [1:12294] 9.37 9.26 9.25 9.17 9.16 9.15 9.13 9.11 9.1 9.11 ...
$ members : num [1:12294] 200630 793665 114262 673572 151266 ...
- attr(*, "spec")=
.. cols(
.. anime_id = col_double(),
.. name = col_character(),
.. genre = col_character(),
.. type = col_character(),
.. episodes = col_double(),
.. rating = col_double(),
.. members = col_double()
.. )
- attr(*, "problems")=<externalptr>
# number of columns and rows
ncol(anime)[1] 7
nrow(anime)[1] 12294
# Names of columns
colnames(anime)[1] "anime_id" "name" "genre" "type" "episodes" "rating" "members"
Genre
In anime data set one of the most interesting variable is genre column. It is quite complicated column because of the general difficulty to attach one anime to the single genre. Instead, one anime has multiple genres, for instance, ‘Action, Adventure, Comedy’ or ‘Romance, Fantasy, Ecci, Sci-Fi’ etc. So, the unique list of these combinations were saved in a vector:
anime_genres <- unique(anime$genre)To get the real genre names that form these combinations first these combinations collapsed into the pieces based on the comma separating them. Then, from these collapsed words unique elements combined into a vector
In order to solve this problem, data in genre column should be split into single characters and then unique characters should be combined again into a vector:
# Split the genre column into a list of genres for each row
genre_lists <- strsplit(anime$genre, ", ")
# Combine all unique genres into one vector
unique_genres <- unique(unlist(genre_lists))
# Print the unique genres
unique_genres [1] "Drama" "Romance" "School" "Supernatural"
[5] "Action" "Adventure" "Fantasy" "Magic"
[9] "Military" "Shounen" "Comedy" "Historical"
[13] "Parody" "Samurai" "Sci-Fi" "Thriller"
[17] "Sports" "Super Power" "Space" "Slice of Life"
[21] "Mecha" "Music" "Mystery" "Seinen"
[25] "Martial Arts" "Vampire" "Shoujo" "Horror"
[29] "Police" "Psychological" "Demons" "Ecchi"
[33] "Josei" "Shounen Ai" "Game" "Dementia"
[37] "Harem" "Cars" "Kids" "Shoujo Ai"
[41] NA "Hentai" "Yaoi" "Yuri"
# Get the length of the new vector which is the count of the real unique genres
length(unique_genres)[1] 44
So, the number of unique genres is actually 44. But as shown in the data set it is difficult to attach one genre to the single anime. Hence, genre column consists of the genres that are the combination of these 44 genres. Of course, there are also mangas in the original data set that have just one single genre.
In order to identify the position of the single genre inside combinations the frequency table and wordcloud were considered useful:
script <- "wordcloud.R"source(script)
word_cloud <- rquery.wordcloud(anime_genres,
type ="text",
lang = "english",
textStemming=FALSE,
min.freq=1,
max.words=2000)freq_table <- word_cloud$freqTable
freq_table word freq
comedy comedy 1288
action action 1194
drama drama 848
adventure adventure 835
scifi scifi 832
fantasy fantasy 814
romance romance 768
shounen shounen 707
supernatural supernatural 603
school school 483
magic magic 377
mecha mecha 365
historical historical 339
ecchi ecchi 335
shoujo shoujo 326
mystery mystery 295
life life 293
slice slice 293
kids kids 277
seinen seinen 262
power power 249
super super 249
horror horror 233
military military 216
demons demons 188
music music 177
harem harem 177
space space 174
arts arts 171
martial martial 171
psychological psychological 162
parody parody 161
sports sports 157
hentai hentai 136
police police 104
samurai samurai 95
game game 95
vampire vampire 74
thriller thriller 66
dementia dementia 53
josei josei 34
cars cars 34
yaoi yaoi 21
yuri yuri 18
So, the combinations mostly contains the genres and topics that are comedy, action, drama, adventure and so on.
Another method to understand this column is to group and summarize animes by genre column:
genres_group <- anime |>
group_by(genre) |>
summarise(anime_count = n()) |>
arrange(desc(anime_count))
genres_group# A tibble: 3,265 × 2
genre anime_count
<chr> <int>
1 Hentai 823
2 Comedy 523
3 Music 301
4 Kids 199
5 Comedy, Slice of Life 179
6 Dementia 137
7 Fantasy, Kids 128
8 Fantasy 114
9 Comedy, Kids 112
10 Drama 107
# ℹ 3,255 more rows
# Summary statistic of the number of animes related to the genres
describe(genres_group$anime_count) vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 3265 3.77 20.03 1 1.67 0 1 823 822 28.38 1015.21 0.35
Grouped data was then filtered to contain genres that have been attached to the 50 and more animes:
# Create a filtered version of grouped data
genres_group_50 <- genres_group |>
filter(anime_count >= 50)# Create a bar chart to visualize genre and their counts based on grouped data
ggplot(data = genres_group_50,
aes(x = as_factor(genre),
y = anime_count)) +
geom_bar(stat = "identity",
fill = "blue") +
coord_flip() +
labs(x = "Genre",
y = "Number of animes",
title = "Anime Genres and their representation",
fill = "Count") +
theme_clean(base_size = 12,
base_family = "sans")Both grouped data and visualization shows that there are 62 animes without genres mentioned. This problem needs to be addressed in data processing phase of the analysis.
# Presentation of the number of animes
ggplot(genres_group_50, aes(x = anime_count)) +
geom_boxplot(outlier.colour = "red",
outlier.shape = 4) +
theme_clean(base_size = 12,
base_family = "sans") +
labs(title = "Distribution of animes by genre",
x = "Number of animes")At this point of the analysis there is already an answer to the first question: Even though the comedy genre are the mostly mentioned genre in combinations, Hentai genre has more mangas attached to it in the data set followed by Comedy and Music. So, this is the answer to the first question asked:
- What are the most popular genres for anime (Number of mangas represented)?
Type
Another variable that contains valuable information about the animes is the `type` variable. It represents a type of media that animes have been introduced like TV series, movie, OVA, ONA etc:
unique(anime$type)[1] "Movie" "TV" "OVA" "Special" "Music" "ONA" NA
# Grouping based on types
types_group <- anime |>
group_by(type) |>
summarise(anime_count = n()) |>
arrange(desc(anime_count))
types_group# A tibble: 7 × 2
type anime_count
<chr> <int>
1 TV 3787
2 OVA 3311
3 Movie 2348
4 Special 1676
5 ONA 659
6 Music 488
7 <NA> 25
Grouping reveals that most of the animes have been introduced in TV format. OVA and movies are the second and the third most widely used types of animes. There are also 25 animes without any type values:
anime |>
filter(is.na(type))# A tibble: 25 × 7
anime_id name genre type episodes rating members
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 30484 Steins;Gate 0 Sci-… <NA> NA NA 60999
2 34437 Code Geass: Fukkatsu no Lelouch Acti… <NA> NA NA 22748
3 33352 Violet Evergarden Dram… <NA> NA NA 20564
4 33248 K: Seven Stories Acti… <NA> NA NA 22133
5 33845 Free! (Shinsaku) Scho… <NA> NA NA 8666
6 33475 Busou Shoujo Machiavellianism Acti… <NA> NA NA 1896
7 31456 Code:Realize: Sousei no Himegimi Adve… <NA> NA NA 4017
8 34332 Flying Babies <NA> <NA> NA NA 22
9 34280 Gamers! Come… <NA> NA NA 1045
10 34485 Ganko-chan <NA> <NA> NA NA 11
# ℹ 15 more rows
Type, the number of episodes and rating information were missing for these 25 animes. As they were very small portion of the data best way was considered to filter them out:
types_group <- anime |>
filter(!is.na(type)) |>
group_by(type) |>
summarise(anime_count = n()) |>
arrange(desc(anime_count))
types_group# A tibble: 6 × 2
type anime_count
<chr> <int>
1 TV 3787
2 OVA 3311
3 Movie 2348
4 Special 1676
5 ONA 659
6 Music 488
# Create a bar chart to visualize types and their counts based on grouped data
ggplot(data = types_group,
aes(x = as_factor(type),
y = anime_count)) +
geom_bar(stat = "identity",
fill ="blue") +
labs(x = "Type",
y = "Number of animes",
title = "Anime types and their representation") +
theme_clean(base_size = 12,
base_family = "sans")# Presentation of the number of animes
ggplot(types_group, aes(x = anime_count)) +
geom_boxplot(outlier.colour = "red",
outlier.shape = 4) +
theme_clean(base_size = 12,
base_family = "sans") +
labs(title = "Distribution of animes by types",
x = "Number of animes")The number of animes attached to each type seem to be more normally distributed than of genres.
Episodes
Episodes column had a data type of character which makes it less useful for analysis. In order to, conduct mathematical operations on this variable its data type was needed to be converted to double. But first, it would be interesting to check if it was containing any NA values as its raw form:
# Check data type of episodes
typeof(anime$episodes)[1] "double"
# Check numbers of NAs
sum(is.na(anime$episodes))[1] 340
So, there is no NA values in it. Now, it is time to view the unique values of the variable:
unique(anime$episodes) [1] 1 64 51 24 10 148 110 13 201 25 22 75 4 26 12
[16] 27 43 74 37 2 11 99 NA 39 101 47 50 62 33 112
[31] 23 3 94 6 8 14 7 40 15 203 77 291 120 102 96
[46] 38 79 175 103 70 153 45 5 21 63 52 28 145 36 69
[61] 60 178 114 35 61 34 109 20 9 49 366 97 48 78 358
[76] 155 104 113 54 167 161 42 142 31 373 220 46 195 17 1787
[91] 73 147 127 16 19 98 150 76 53 124 29 115 224 44 58
[106] 93 154 92 67 172 86 30 276 59 72 330 41 105 128 137
[121] 56 55 65 243 193 18 191 180 91 192 66 182 32 164 100
[136] 296 694 95 68 117 151 130 87 170 119 84 108 156 140 331
[151] 305 300 510 200 88 1471 526 143 726 136 1818 237 1428 365 163
[166] 283 71 260 199 225 312 240 1306 1565 773 1274 90 475 263 83
[181] 85 1006 80 162 132 141 125
But it contains the string “Unknown” which will be converted to NA if as.double() function is applied to change the data type.
total_unknown <- sum(anime$episodes == "Unknown")
total_unknown[1] NA
The Data frame was removed and imported again to consider “Unknown” values as NAs. This was done in the same script above and after importing it type conversion was conducted:
# Convert episodes to the number type
anime$episodes <- as.double(anime$episodes)So, 340 “Unknown” values are now NAs which is R’s way to denote unknown values. Conversion completed successfully and now statistical functions can be applied to this variable:
# Minimum number of episodes
min(anime$episodes, na.rm = TRUE)[1] 1
# Maximum number of episodes
max(anime$episodes, na.rm = TRUE)[1] 1818
# Descriptive statistics
describe(anime$episodes) vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 11954 12.38 46.87 2 5.96 1.48 1 1818 1817 23.38 732.85
se
X1 0.43
# Grouping based on episodes, name, genre, type
episodes_group <- anime |>
group_by(episodes, name, genre, type) |>
summarise(
anime_count = n(),
.groups = "keep") |>
arrange(desc(episodes))
episodes_group# A tibble: 12,294 × 5
# Groups: episodes, name, genre, type [12,294]
episodes name genre type anime_count
<dbl> <chr> <chr> <chr> <int>
1 1818 Oyako Club Comedy, Sli… TV 1
2 1787 Doraemon (1979) Adventure, … TV 1
3 1565 Kirin Monoshiri Yakata Kids TV 1
4 1471 Manga Nippon Mukashibanashi (1976) Fantasy, Hi… TV 1
5 1428 Hoka Hoka Kazoku Comedy TV 1
6 1306 Kirin Ashita no Calendar Historical,… TV 1
7 1274 Monoshiri Daigaku: Ashita no Calendar Historical TV 1
8 1006 Sekai Monoshiri Ryoko Comedy TV 1
9 773 Kotowaza House Comedy, Sli… TV 1
10 726 Shima Shima Tora no Shimajirou Adventure, … TV 1
# ℹ 12,284 more rows
# filter animes with episodes count greater than 100
episodes_group |>
filter(episodes > 100) |>
# create visualization based on this filter to see the distribution of episodes count
ggplot(aes(x = episodes)) +
geom_histogram(binwidth = 50,
fill = "blue") +
labs(
title = "Distribution of episodes",
x = "Number of episodes",
y = "Frequency"
) +
theme_clean(base_size = 12,
base_family = "sans")There are only a few animes that have 1000 and more episodes. Most of the animes in this data set have 1 and 2 episodes. The episode count of series ranges between 2 and 1818. So, maximum number of episodes is 1818 which belongs to the anime “Oyako Club” written in the genre “Comedy, Slice of Life” which is the answer to our fifth question:
- What is the maximum number of episodes and which anime it is with the maximum number of episodes?
Rating
This column is the collection of the average ratings animes got from users. The real ratings from the users have been collected in the rating.csv file which was joined the the anime data frame in the later parts of the analysis. But it is still interesting column, so, observations were explored in this part too:
describe(anime$rating) vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 12064 6.47 1.03 6.57 6.53 0.96 1.67 10 8.33 -0.54 0.51 0.01
# Presentation of ratings
ggplot(anime, aes(x = rating)) +
geom_boxplot(outlier.colour = "red",
outlier.shape = 4) +
theme_clean(base_size = 12,
base_family = "sans") +
labs(title = "Average ratings of animes",
x = "Average rating")Most animes have got ratings between 6 and 7. There are both negative and positive outliers in this variable. The distribution of the rating is left-skewed:
skewness(anime$rating, na.rm = TRUE)[1] -0.5434349
ggplot(anime, aes(x = rating)) +
geom_histogram(binwidth = 0.5,
fill = "blue",
color = "black") +
labs(title = "Distribution of rating",
x = "Rating",
y = "Frequency") +
theme_clean(base_size = 12,
base_family = "sans")Members
This variable is about online groups of animes. It represents how many members the group that dedicated to the single anime has:
describe(anime$members) vars n mean sd median trimmed mad min max range skew
X1 1 12294 18071.34 54820.68 1550 5589.94 2172.01 5 1013917 1013912 6.68
kurtosis se
X1 62.82 494.42
So, this variable is also full of outliers and strangeness: It has minimum member count 5 and maximum of 1013912. Better way to work with variable was thought to be to filter out smaller member groups.
ggplot(anime,
aes(x = members)) +
geom_boxplot(outlier.colour = "red",
outlier.shape = 4) +
theme_clean(base_size = 12,
base_family = "sans") +
labs(title = "Member groups and the member count of them",
x = "Members")Exploration of rating data set
rating data frame has only three columns. Most important ones are user_id - which will be used to join it with anime data frame and rating which is the real rating each user rated the specific anime.
# View first few lines to know what kind of information a data frame contains
head(rating) user_id anime_id rating
1 1 20 -1
2 1 24 -1
3 1 79 -1
4 1 226 -1
5 1 241 -1
6 1 355 -1
# Get summary statistics of the dataset
summary(rating) user_id anime_id rating
Min. : 1 Min. : 1 Min. :-1.000
1st Qu.:18974 1st Qu.: 1240 1st Qu.: 6.000
Median :36791 Median : 6213 Median : 7.000
Mean :36728 Mean : 8909 Mean : 6.144
3rd Qu.:54757 3rd Qu.:14093 3rd Qu.: 9.000
Max. :73516 Max. :34519 Max. :10.000
As noted at the Introduction section rating -1 meant no rating. It caused confusion in the analysis, so, it was converted to more “useful” value in data processing section.
# Presentation of ratings
# Confusing vizualiation: -1 is also displayed as a legal value
ggplot(rating, aes(x = rating)) +
geom_boxplot(outlier.colour = "red",
outlier.shape = 4) +
theme_clean(base_size = 12,
base_family = "sans") +
labs(title = "Ratings of animes",
subtitle = "Ratings are integers from 1 to 10",
x = "Rating")# Check the data types of each column
str(rating)'data.frame': 7813737 obs. of 3 variables:
$ user_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ anime_id: int 20 24 79 226 241 355 356 442 487 846 ...
$ rating : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
# number of columns and rows
ncol(rating)[1] 3
nrow(rating)[1] 7813737
# Names of columns
colnames(rating)[1] "user_id" "anime_id" "rating"
Joined data frame
After exploring two data sets separately, they were joined based on anime_id column:
animes_joined <- inner_join(anime,
rating,
by = "anime_id")head(animes_joined)# A tibble: 6 × 9
anime_id name genre type episodes rating.x members user_id rating.y
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <int> <int>
1 32281 Kimi no Na wa. Dram… Movie 1 9.37 200630 99 5
2 32281 Kimi no Na wa. Dram… Movie 1 9.37 200630 152 10
3 32281 Kimi no Na wa. Dram… Movie 1 9.37 200630 244 10
4 32281 Kimi no Na wa. Dram… Movie 1 9.37 200630 271 10
5 32281 Kimi no Na wa. Dram… Movie 1 9.37 200630 278 -1
6 32281 Kimi no Na wa. Dram… Movie 1 9.37 200630 322 10
str(animes_joined)spc_tbl_ [7,813,727 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ anime_id: num [1:7813727] 32281 32281 32281 32281 32281 ...
$ name : chr [1:7813727] "Kimi no Na wa." "Kimi no Na wa." "Kimi no Na wa." "Kimi no Na wa." ...
$ genre : chr [1:7813727] "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" ...
$ type : chr [1:7813727] "Movie" "Movie" "Movie" "Movie" ...
$ episodes: num [1:7813727] 1 1 1 1 1 1 1 1 1 1 ...
$ rating.x: num [1:7813727] 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 ...
$ members : num [1:7813727] 200630 200630 200630 200630 200630 ...
$ user_id : int [1:7813727] 99 152 244 271 278 322 398 462 490 548 ...
$ rating.y: int [1:7813727] 5 10 10 10 -1 10 10 8 10 10 ...
- attr(*, "spec")=
.. cols(
.. anime_id = col_double(),
.. name = col_character(),
.. genre = col_character(),
.. type = col_character(),
.. episodes = col_double(),
.. rating = col_double(),
.. members = col_double()
.. )
- attr(*, "problems")=<externalptr>
PROCESS - DATA CLEANING
Before this phase some basic data cleaning had already been applied to the anime data frame separately:
- Data type of
episodesvariable has been converted from character to double.
It means when it was joined with rating data frame, rating variable carried its data type and values all together. So, all those -1 values are present in rating column of joined data frame. NAs. can be treated differently based on the type, purpose, reliability and other factors. In this specific analysis they were converted to average values of the rating.
But before that one more problem with the joined data frame was so obvious: Column names rating.x, rating.y were not user friendly, so, they were replaced with the descriptive ones:
# Rename rating.x, rating.y variables
animes_joined <- animes_joined |>
rename(
average_rating = rating.x,
user_rating = rating.y
)# Check data frame after renaming
str(animes_joined)spc_tbl_ [7,813,727 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ anime_id : num [1:7813727] 32281 32281 32281 32281 32281 ...
$ name : chr [1:7813727] "Kimi no Na wa." "Kimi no Na wa." "Kimi no Na wa." "Kimi no Na wa." ...
$ genre : chr [1:7813727] "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" "Drama, Romance, School, Supernatural" ...
$ type : chr [1:7813727] "Movie" "Movie" "Movie" "Movie" ...
$ episodes : num [1:7813727] 1 1 1 1 1 1 1 1 1 1 ...
$ average_rating: num [1:7813727] 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 9.37 ...
$ members : num [1:7813727] 200630 200630 200630 200630 200630 ...
$ user_id : int [1:7813727] 99 152 244 271 278 322 398 462 490 548 ...
$ user_rating : int [1:7813727] 5 10 10 10 -1 10 10 8 10 10 ...
- attr(*, "spec")=
.. cols(
.. anime_id = col_double(),
.. name = col_character(),
.. genre = col_character(),
.. type = col_character(),
.. episodes = col_double(),
.. rating = col_double(),
.. members = col_double()
.. )
- attr(*, "problems")=<externalptr>
In order to proceed to replace -1 values with averages, unique values in the user_rating variable were displayed. Then, to check the effectiveness of the replacement, total number of -1 ratings has been identified to compare the before and after results:
# Distinct values in user.rating column
unique(animes_joined$user_rating) [1] 5 10 -1 8 9 7 6 2 4 1 3
# Total number of -1 values
abs(
sum(
animes_joined$user_rating[animes_joined$user_rating == -1]
)
)[1] 1476488
# Replacement of -1 with mean value
animes_joined$user_rating[animes_joined$user_rating == -1] <- round(
mean(animes_joined$user_rating,
na.rm = TRUE),
2
)# Distinct values in user.rating column after the replacement
unique(animes_joined$user_rating) [1] 5.00 10.00 6.14 8.00 9.00 7.00 6.00 2.00 4.00 1.00 3.00
# Total number of NAs after the replacement
sum(is.na(animes_joined$user_rating))[1] 0
After this replacement the box plot recreated to check if there are still other anomalies in this column:
ggplot(animes_joined,
aes(x = user_rating)) +
geom_boxplot(outlier.colour = "red",
outlier.shape = 4) +
theme_clean(base_size = 12,
base_family = "sans") +
labs(title = "Ratings of animes",
subtitle = "Ratings are integers from 1 to 10",
x = "Rating")ANALYSIS
Ratings of the genres
To get the summary of the ratings based on the genres, joined data frame was grouped:
genres_summary <- animes_joined |>
group_by(genre) |>
summarise(
avg_rating = round(
mean(user_rating,
na.rm = TRUE),
2
),
min_rating = min(user_rating,
na.rm = TRUE),
max_rating = max(user_rating,
na.rm = TRUE)
) |>
arrange(desc(avg_rating))
genres_summary# A tibble: 3,155 × 4
genre avg_rating min_rating max_rating
<chr> <dbl> <dbl> <dbl>
1 Action, Historical, Kids 10 10 10
2 Action, Adventure, Drama, Fantasy, Magic, M… 8.92 1 10
3 Drama, Fantasy, Romance, Slice of Life, Sup… 8.77 1 10
4 Action, Drama, Mecha, Military, Sci-Fi, Sup… 8.68 1 10
5 Action, Comedy, Historical, Parody, Samurai… 8.66 1 10
6 Drama, Music, Romance, School, Shounen 8.63 1 10
7 Sci-Fi, Thriller 8.62 1 10
8 Drama, Horror, Mystery, Police, Psychologic… 8.61 1 10
9 Drama, Romance, School, Supernatural 8.61 1 10
10 Action, Mecha, Military, School, Sci-Fi, Su… 8.57 1 10
# ℹ 3,145 more rows
So, mangas in Action, Historical, Kids genre revealed interesting result: Their minimum, maximum ratings are 10 out of 10. Data was filtered to show the mangas in this genre:
animes_joined |>
filter(genre == "Action, Historical, Kids")# A tibble: 1 × 9
anime_id name genre type episodes average_rating members user_id user_rating
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <int> <dbl>
1 33484 Shir… Acti… Movie 1 4.71 45 69497 10
Only 1 manga in this genre was represented and it got 10 from the only user that watched it. It seemed suspicious so, better approach was considered to filter out this genre:
genres_summary <- animes_joined |>
filter(genre != "Action, Historical, Kids") |>
group_by(genre) |>
summarise(
avg_rating = round(
mean(
user_rating,
na.rm = TRUE),
2
),
min_rating = min(user_rating,
na.rm = TRUE),
max_rating = max(user_rating,
na.rm = TRUE)
) |>
arrange(genre)
# arrange(desc(avg_rating))
genres_summary# A tibble: 3,153 × 4
genre avg_rating min_rating max_rating
<chr> <dbl> <dbl> <dbl>
1 Action 6.69 1 10
2 Action, Adventure 7.01 1 10
3 Action, Adventure, Cars, Comedy, Sci-Fi, Sh… 6.89 2 10
4 Action, Adventure, Cars, Mecha, Sci-Fi, Sho… 5.98 1 10
5 Action, Adventure, Cars, Sci-Fi 6.75 1 10
6 Action, Adventure, Comedy 6.8 1 10
7 Action, Adventure, Comedy, Demons, Drama, E… 7.04 3 10
8 Action, Adventure, Comedy, Demons, Fantasy,… 7.19 2 10
9 Action, Adventure, Comedy, Demons, Fantasy,… 7.79 1 10
10 Action, Adventure, Comedy, Demons, Fantasy,… 6.91 1 10
# ℹ 3,143 more rows
So, even Hentai genre have more mangas, action mangas got the highest ratings which answers the third question of the analysis:
- What are the maximum and minimum ratings that the specific genres received and which genres have the highest user rating?
Ratings of the types
As animes were represented via different media forms, the rating based on these media forms summarized and visualized:
types_summary <-
animes_joined |>
group_by(type) |>
summarise(
avg_rating = round(
mean(
user_rating,
na.rm = TRUE),
2
),
min_rating = min(user_rating,
na.rm = TRUE),
max_rating = max(user_rating,
na.rm = TRUE)
) %>%
arrange(desc(avg_rating))
types_summary# A tibble: 7 × 4
type avg_rating min_rating max_rating
<chr> <dbl> <dbl> <dbl>
1 <NA> 8.5 7 9
2 TV 7.59 1 10
3 Movie 7.57 1 10
4 Special 7.14 1 10
5 OVA 7.06 1 10
6 ONA 6.99 1 10
7 Music 6.95 1 10
Again the unknown type of animes was present here and it got the highest average rating. Filtering revealed that it was the same mange named Steins;Gate 0 with lacking information about type, episodes and average rating and got 7 and 9 from only 5 users. It was also filtered out:
types_summary <-
animes_joined |>
filter(type != "") |>
group_by(type) |>
summarise(
avg_rating = round(
mean(user_rating,
na.rm = TRUE),
2
),
min_rating = min(user_rating,
na.rm = TRUE),
max_rating = max(user_rating,
na.rm = TRUE)
) |>
arrange(desc(avg_rating))
types_summary# A tibble: 6 × 4
type avg_rating min_rating max_rating
<chr> <dbl> <dbl> <dbl>
1 TV 7.59 1 10
2 Movie 7.57 1 10
3 Special 7.14 1 10
4 OVA 7.06 1 10
5 ONA 6.99 1 10
6 Music 6.95 1 10
# Highest rated formats for mangas
ggplot(data = types_summary,
aes(x = as_factor(type),
y = avg_rating)) +
geom_bar(stat = "identity",
fill ="blue") +
labs(x = "Type",
y = "Average rating",
title = "Anime types and average rating") +
theme_clean(base_size = 12,
base_family = "sans")So, beside small differences all formats seem to get higher user ratings which is the answer to the fourth question:
- What is the type of an animes that got the highest ratings?
Relationship between the number of episodes and rating
ggplot(animes_joined, mapping = aes(
x = cut(episodes, breaks = c(1, 1000, 2000)),
y = user_rating)
) +
geom_violin(fill = "red") +
scale_x_discrete(
labels = c("1-500",
"500-1000",
"1000-1900")) +
labs(title = "Episodes and rating",
x = "Number of episodes",
y = "Rating") +
theme_clean(base_size = 12,
base_family = "sans")Here there are three groups:
Mangas with episode count between 1 to 500
Episode count between 500 to the 1000
Episode count between 1000 to the 1900 (1818 is the maximum)
In each group most of the mangas received the rating above 5, so the number of episodes does not have correlation with the user rating.
Relationship between the number of members and rating
ggplot(animes_joined, mapping =
aes(x = cut(members, breaks =
c(1,
500000,
1000000,
1014000)),
y = user_rating)) +
geom_violin(fill = "red") +
scale_x_discrete(
labels = c("1-500000",
"500000-1000000",
"1000000-1014000")) +
labs(title = "Number of members and rating",
x = "Members",
y = "Rating") +
theme_clean(base_size = 12,
base_family = "sans")Here also, mangas have been grouped into three groups:
Groups with member count up to 500000
Member count from 500000 to 1000000
Member count between 1000000 to 1014000 (1013917 is the maximum)
In each group most of the mangas received the rating above 5. Also, there are more mangas starting with the rating 5 in user groups with members up to 500000 than others. So the number of members in user groups does not have correlation with the user rating.
Conclusion, limitations and notes
As mentioned above, genres of animes are quite complicated. The data set provided contained most animes in Hentai, comedy and music. But the combinations of these genres are different. In this case comedy, action, drama, adventure, scifi genres have more mangas.
Any format of the manga are popular and mangas in all formats received almost equal ratings from the users.
There are no relationship between episodes count, number of members and user rating.
The major limitation of the data set is that it does not contain the information about mangas in paper format.
Notes:
This analysis was written in Quarto markdown document. Viewer should have at least basic understanding of R programming language, pipe operator, visualization with ggplot2 and other tools such as functions used in the analysis.