knitr::include_graphics("C:/Users/maddi/Downloads/chika.jpg")

Introduction

The Anime Recommendations Dataset is a collection of anime titles and their related information such as genre, rating, number of episodes, and number of members who have watched the show. The dataset is available on Kaggle. This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.

Some of the variables in this data set include, anime_id, name, genre, type, episodes, rating, and members.

There is no information on how the data was collected or how the ratings were gathered. However, some of the data was collected from MyAnimeList.net, a popular anime and manga community that allows users to rate and review anime titles.

I chose this topic because I love to watch anime and it is a popular form of entertainment for many. Anime has a significant cultural impact on many around the globe.

Load the libraries

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(lubridate)
library(dplyr)
library(RColorBrewer)

anime <- read_csv("anime.csv")
## Rows: 12294 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, genre, type, episodes
## dbl (3): anime_id, rating, members
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning

str(anime)
## spc_tbl_ [12,294 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ anime_id: num [1:12294] 32281 5114 28977 9253 9969 ...
##  $ name    : chr [1:12294] "Kimi no Na wa." "Fullmetal Alchemist: Brotherhood" "Gintama°" "Steins;Gate" ...
##  $ genre   : chr [1:12294] "Drama, Romance, School, Supernatural" "Action, Adventure, Drama, Fantasy, Magic, Military, Shounen" "Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen" "Sci-Fi, Thriller" ...
##  $ type    : chr [1:12294] "Movie" "TV" "TV" "TV" ...
##  $ episodes: chr [1:12294] "1" "64" "51" "24" ...
##  $ rating  : num [1:12294] 9.37 9.26 9.25 9.17 9.16 9.15 9.13 9.11 9.1 9.11 ...
##  $ members : num [1:12294] 200630 793665 114262 673572 151266 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   anime_id = col_double(),
##   ..   name = col_character(),
##   ..   genre = col_character(),
##   ..   type = col_character(),
##   ..   episodes = col_character(),
##   ..   rating = col_double(),
##   ..   members = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(anime)
##     anime_id         name              genre               type          
##  Min.   :    1   Length:12294       Length:12294       Length:12294      
##  1st Qu.: 3484   Class :character   Class :character   Class :character  
##  Median :10260   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :14058                                                           
##  3rd Qu.:24795                                                           
##  Max.   :34527                                                           
##                                                                          
##    episodes             rating          members       
##  Length:12294       Min.   : 1.670   Min.   :      5  
##  Class :character   1st Qu.: 5.880   1st Qu.:    225  
##  Mode  :character   Median : 6.570   Median :   1550  
##                     Mean   : 6.474   Mean   :  18071  
##                     3rd Qu.: 7.180   3rd Qu.:   9437  
##                     Max.   :10.000   Max.   :1013917  
##                     NA's   :230
head(anime)
## # A tibble: 6 × 7
##   anime_id name                              genre type  episodes rating members
##      <dbl> <chr>                             <chr> <chr> <chr>     <dbl>   <dbl>
## 1    32281 Kimi no Na wa.                    Dram… Movie 1          9.37  200630
## 2     5114 Fullmetal Alchemist: Brotherhood  Acti… TV    64         9.26  793665
## 3    28977 Gintama°                          Acti… TV    51         9.25  114262
## 4     9253 Steins;Gate                       Sci-… TV    24         9.17  673572
## 5     9969 Gintama&#039;                     Acti… TV    51         9.16  151266
## 6    32935 Haikyuu!!: Karasuno Koukou VS Sh… Come… TV    10         9.15   93351
# read in the anime.csv dataset
anime <- read.csv("anime.csv", stringsAsFactors = FALSE)

# replace all NA values in the "episodes" column with 0
anime$episodes[is.na(anime$episodes)] <- 0

# convert the "episodes" column to numeric
anime$episodes <- as.numeric(anime$episodes)
## Warning: NAs introduced by coercion
# replace all NA values in the "rating" column with 0
anime$rating[is.na(anime$rating)] <- 0

# convert the "rating" column to numeric
anime$rating <- as.numeric(anime$rating)

# replace all NA values in the "members" column with 0
anime$members[is.na(anime$members)] <- 0

# convert the "members" column to numeric
anime$members <- as.numeric(anime$members)

# replace all NA values in the "genre" column with "Unknown"
anime$genre[is.na(anime$genre)] <- "Unknown"

# remove any leading or trailing whitespaces in the "genre" column
anime$genre <- trimws(anime$genre)

# remove any duplicates in the dataset
anime <- distinct(anime)

# check the structure of the cleaned dataset
str(anime)
## 'data.frame':    12294 obs. of  7 variables:
##  $ anime_id: int  32281 5114 28977 9253 9969 32935 11061 820 15335 15417 ...
##  $ name    : chr  "Kimi no Na wa." "Fullmetal Alchemist: Brotherhood" "Gintama°" "Steins;Gate" ...
##  $ genre   : chr  "Drama, Romance, School, Supernatural" "Action, Adventure, Drama, Fantasy, Magic, Military, Shounen" "Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen" "Sci-Fi, Thriller" ...
##  $ type    : chr  "Movie" "TV" "TV" "TV" ...
##  $ episodes: num  1 64 51 24 51 10 148 110 1 13 ...
##  $ rating  : num  9.37 9.26 9.25 9.17 9.16 9.15 9.13 9.11 9.1 9.11 ...
##  $ members : num  200630 793665 114262 673572 151266 ...

Correlation matrix of the numeric variables in the data set

# Convert the rating variables to numeric
anime$rating <- as.numeric(anime$rating)
anime$members <- as.numeric(anime$members)

# Calculate the correlation matrix. A correlation coefficient of 1 indicates a perfect positive correlation, 0 indicates no correlation, and -1 indicates a perfect negative correlation.
cor_matrix <- cor(anime %>% select_if(is.numeric))
cor_matrix
##             anime_id episodes     rating     members
## anime_id  1.00000000       NA -0.3559499 -0.08007118
## episodes          NA        1         NA          NA
## rating   -0.35594989       NA  1.0000000  0.31142103
## members  -0.08007118       NA  0.3114210  1.00000000

Visualization 1: Distribution of Anime Ratings

# create histogram of anime ratings
ggplot(anime, aes(x = rating)) +
  geom_histogram(fill = "#DDA0DD", color = "#EE82EE") +
  labs(title = "Distribution of Anime Ratings",
       x = "Rating",
       y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(anime, aes(x = rating)) +
  geom_histogram(binwidth = 0.5, color = "#E6E6FA", fill = "#D8BFD8") +
  scale_x_continuous(breaks = seq(0, 10, 0.5)) +
  labs(title = "Distribution of Anime Ratings",
       x = "Rating",
       y = "Frequency",
       fill = "") +
  theme_minimal()

Visualization 2: Relationships of Ratings and Members

ggplot(anime, aes(x = rating, y = members)) +
  geom_point(alpha = 0.6, color = "#DA70D6") +
  scale_x_continuous(limits = c(0, 10)) +
  labs(title = "Relationship Between Anime Ratings and Members",
       x = "Rating",
       y = "Members",
       fill = "") +
  theme_minimal()

Visualization 3: Ratings by Anime Type

anime %>%
  filter(!is.na(type)) %>%
  group_by(type, rating) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = rating, y = count, fill = type)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("#FFC0CB", "#FFB6C1", "#FF69B4", "#FF1493", "#C71585", "#DB7093", "#FC94AF")) +
  labs(title = "Anime Type by Rating", x = "Rating", y = "Count") +
  theme_minimal()
## `summarise()` has grouped output by 'type'. You can override using the
## `.groups` argument.

Visualtization 4: Quantity of Anime per Type

# count the number of anime in the dataset for each type
anime_count <- anime %>%
  group_by(type) %>%
  summarize(count = n())

# create bar plot of anime counts by type
ggplot(anime_count, aes(x = type, y = count, fill = type)) +
  geom_col() +
  scale_fill_brewer(palette = "RdPu") +
  labs(title = "Number of Anime per Type",
       x = "Type",
       y = "Count",
       fill = "Type")

Visulaization 5: Anime by Genre

# filter out rows with missing genre information
anime <- anime %>% filter(!is.na(genre))

# group anime titles by genre and count the number of titles in each genre
genre_counts <- anime %>% 
  mutate(genre = strsplit(as.character(genre), ", ")) %>% 
  unnest(genre) %>% 
  group_by(genre) %>% 
  summarize(num_titles = n())

# plot the number of anime titles in each genre
ggplot(genre_counts, aes(x = genre, y = num_titles)) + 
  geom_bar(stat = "identity", fill = "#db7093") + 
  labs(title = "Number of Anime Titles by Genre",
       x = "Genre",
       y = "Number of Titles") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major.y = element_line(color = "#ffc0cb"))

Visualization 6: Top 20 Anime

anime %>%
  filter(!is.na(members) & !is.na(rating)) %>%
  arrange(desc(members)) %>%
  slice(1:20) %>%
  ggplot(aes(x = reorder(name, -members), y = rating, size = members, color = rating)) +
  geom_point(alpha = 0.8) +
  scale_color_gradient(low = "#FFC0CB", high = "#C71585") +
  labs(title = "Top 20 Anime by Members and Rating", x = "Anime Titles", y = "Average Rating") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

According to NFI, Anime is a Japanese animation style that is produced or influenced by it. It is the Japanese term for cartoon or animation. However, outside Japan, anime denotes animation movies that come exclusively from Japan, distinguished by blazing graphics, energetic characters, and attractive themes such as sci-fi, romance, and supernatural forces. https://www.nfi.edu/what-is-anime/

Each of my visualizations represent mainly how the ratings and the members affect the show itself. In my first two, I showed the ratings based on the members. I cleaned it up in the second one. Then in my third visualization, I showed how based on the members and rating, where it would fall on the scatter plot. Here, we can see that the shows with more members tend to have higher scores. My next visualization shows the different anime ratings and members but now colored by the different types. We can see that mainly TV shows had the most. In my next visualization, we can see the number of animes in the data set based on type. Again, the one with the most was the TV type. Next, I showed the amount of anime titles by genre. In this graph we can see that the genre with the most titles was comedy and that there was a close running between a couple different categories for the least. Fianlly, my last visualization shows the top 20 anime by ratings and then the amount of members represetned by the size of teh bubble. Here we can see that Death Note has the largest following but Fullmetal Alchemist: Brotherhood had the highest rating