knitr::include_graphics("C:/Users/maddi/Downloads/chika.jpg")
The Anime Recommendations Dataset is a collection of anime titles and their related information such as genre, rating, number of episodes, and number of members who have watched the show. The dataset is available on Kaggle. This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.
Some of the variables in this data set include, anime_id, name, genre, type, episodes, rating, and members.
There is no information on how the data was collected or how the ratings were gathered. However, some of the data was collected from MyAnimeList.net, a popular anime and manga community that allows users to rate and review anime titles.
I chose this topic because I love to watch anime and it is a popular form of entertainment for many. Anime has a significant cultural impact on many around the globe.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(lubridate)
library(dplyr)
library(RColorBrewer)
anime <- read_csv("anime.csv")
## Rows: 12294 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, genre, type, episodes
## dbl (3): anime_id, rating, members
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(anime)
## spc_tbl_ [12,294 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ anime_id: num [1:12294] 32281 5114 28977 9253 9969 ...
## $ name : chr [1:12294] "Kimi no Na wa." "Fullmetal Alchemist: Brotherhood" "Gintama°" "Steins;Gate" ...
## $ genre : chr [1:12294] "Drama, Romance, School, Supernatural" "Action, Adventure, Drama, Fantasy, Magic, Military, Shounen" "Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen" "Sci-Fi, Thriller" ...
## $ type : chr [1:12294] "Movie" "TV" "TV" "TV" ...
## $ episodes: chr [1:12294] "1" "64" "51" "24" ...
## $ rating : num [1:12294] 9.37 9.26 9.25 9.17 9.16 9.15 9.13 9.11 9.1 9.11 ...
## $ members : num [1:12294] 200630 793665 114262 673572 151266 ...
## - attr(*, "spec")=
## .. cols(
## .. anime_id = col_double(),
## .. name = col_character(),
## .. genre = col_character(),
## .. type = col_character(),
## .. episodes = col_character(),
## .. rating = col_double(),
## .. members = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(anime)
## anime_id name genre type
## Min. : 1 Length:12294 Length:12294 Length:12294
## 1st Qu.: 3484 Class :character Class :character Class :character
## Median :10260 Mode :character Mode :character Mode :character
## Mean :14058
## 3rd Qu.:24795
## Max. :34527
##
## episodes rating members
## Length:12294 Min. : 1.670 Min. : 5
## Class :character 1st Qu.: 5.880 1st Qu.: 225
## Mode :character Median : 6.570 Median : 1550
## Mean : 6.474 Mean : 18071
## 3rd Qu.: 7.180 3rd Qu.: 9437
## Max. :10.000 Max. :1013917
## NA's :230
head(anime)
## # A tibble: 6 × 7
## anime_id name genre type episodes rating members
## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 32281 Kimi no Na wa. Dram… Movie 1 9.37 200630
## 2 5114 Fullmetal Alchemist: Brotherhood Acti… TV 64 9.26 793665
## 3 28977 Gintama° Acti… TV 51 9.25 114262
## 4 9253 Steins;Gate Sci-… TV 24 9.17 673572
## 5 9969 Gintama' Acti… TV 51 9.16 151266
## 6 32935 Haikyuu!!: Karasuno Koukou VS Sh… Come… TV 10 9.15 93351
# read in the anime.csv dataset
anime <- read.csv("anime.csv", stringsAsFactors = FALSE)
# replace all NA values in the "episodes" column with 0
anime$episodes[is.na(anime$episodes)] <- 0
# convert the "episodes" column to numeric
anime$episodes <- as.numeric(anime$episodes)
## Warning: NAs introduced by coercion
# replace all NA values in the "rating" column with 0
anime$rating[is.na(anime$rating)] <- 0
# convert the "rating" column to numeric
anime$rating <- as.numeric(anime$rating)
# replace all NA values in the "members" column with 0
anime$members[is.na(anime$members)] <- 0
# convert the "members" column to numeric
anime$members <- as.numeric(anime$members)
# replace all NA values in the "genre" column with "Unknown"
anime$genre[is.na(anime$genre)] <- "Unknown"
# remove any leading or trailing whitespaces in the "genre" column
anime$genre <- trimws(anime$genre)
# remove any duplicates in the dataset
anime <- distinct(anime)
# check the structure of the cleaned dataset
str(anime)
## 'data.frame': 12294 obs. of 7 variables:
## $ anime_id: int 32281 5114 28977 9253 9969 32935 11061 820 15335 15417 ...
## $ name : chr "Kimi no Na wa." "Fullmetal Alchemist: Brotherhood" "Gintama°" "Steins;Gate" ...
## $ genre : chr "Drama, Romance, School, Supernatural" "Action, Adventure, Drama, Fantasy, Magic, Military, Shounen" "Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen" "Sci-Fi, Thriller" ...
## $ type : chr "Movie" "TV" "TV" "TV" ...
## $ episodes: num 1 64 51 24 51 10 148 110 1 13 ...
## $ rating : num 9.37 9.26 9.25 9.17 9.16 9.15 9.13 9.11 9.1 9.11 ...
## $ members : num 200630 793665 114262 673572 151266 ...
# Convert the rating variables to numeric
anime$rating <- as.numeric(anime$rating)
anime$members <- as.numeric(anime$members)
# Calculate the correlation matrix. A correlation coefficient of 1 indicates a perfect positive correlation, 0 indicates no correlation, and -1 indicates a perfect negative correlation.
cor_matrix <- cor(anime %>% select_if(is.numeric))
cor_matrix
## anime_id episodes rating members
## anime_id 1.00000000 NA -0.3559499 -0.08007118
## episodes NA 1 NA NA
## rating -0.35594989 NA 1.0000000 0.31142103
## members -0.08007118 NA 0.3114210 1.00000000
# create histogram of anime ratings
ggplot(anime, aes(x = rating)) +
geom_histogram(fill = "#DDA0DD", color = "#EE82EE") +
labs(title = "Distribution of Anime Ratings",
x = "Rating",
y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(anime, aes(x = rating)) +
geom_histogram(binwidth = 0.5, color = "#E6E6FA", fill = "#D8BFD8") +
scale_x_continuous(breaks = seq(0, 10, 0.5)) +
labs(title = "Distribution of Anime Ratings",
x = "Rating",
y = "Frequency",
fill = "") +
theme_minimal()
ggplot(anime, aes(x = rating, y = members)) +
geom_point(alpha = 0.6, color = "#DA70D6") +
scale_x_continuous(limits = c(0, 10)) +
labs(title = "Relationship Between Anime Ratings and Members",
x = "Rating",
y = "Members",
fill = "") +
theme_minimal()
anime %>%
filter(!is.na(type)) %>%
group_by(type, rating) %>%
summarise(count = n()) %>%
ggplot(aes(x = rating, y = count, fill = type)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("#FFC0CB", "#FFB6C1", "#FF69B4", "#FF1493", "#C71585", "#DB7093", "#FC94AF")) +
labs(title = "Anime Type by Rating", x = "Rating", y = "Count") +
theme_minimal()
## `summarise()` has grouped output by 'type'. You can override using the
## `.groups` argument.
# count the number of anime in the dataset for each type
anime_count <- anime %>%
group_by(type) %>%
summarize(count = n())
# create bar plot of anime counts by type
ggplot(anime_count, aes(x = type, y = count, fill = type)) +
geom_col() +
scale_fill_brewer(palette = "RdPu") +
labs(title = "Number of Anime per Type",
x = "Type",
y = "Count",
fill = "Type")
# filter out rows with missing genre information
anime <- anime %>% filter(!is.na(genre))
# group anime titles by genre and count the number of titles in each genre
genre_counts <- anime %>%
mutate(genre = strsplit(as.character(genre), ", ")) %>%
unnest(genre) %>%
group_by(genre) %>%
summarize(num_titles = n())
# plot the number of anime titles in each genre
ggplot(genre_counts, aes(x = genre, y = num_titles)) +
geom_bar(stat = "identity", fill = "#db7093") +
labs(title = "Number of Anime Titles by Genre",
x = "Genre",
y = "Number of Titles") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid.major.y = element_line(color = "#ffc0cb"))
anime %>%
filter(!is.na(members) & !is.na(rating)) %>%
arrange(desc(members)) %>%
slice(1:20) %>%
ggplot(aes(x = reorder(name, -members), y = rating, size = members, color = rating)) +
geom_point(alpha = 0.8) +
scale_color_gradient(low = "#FFC0CB", high = "#C71585") +
labs(title = "Top 20 Anime by Members and Rating", x = "Anime Titles", y = "Average Rating") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
According to NFI, Anime is a Japanese animation style that is produced or influenced by it. It is the Japanese term for cartoon or animation. However, outside Japan, anime denotes animation movies that come exclusively from Japan, distinguished by blazing graphics, energetic characters, and attractive themes such as sci-fi, romance, and supernatural forces. https://www.nfi.edu/what-is-anime/
Each of my visualizations represent mainly how the ratings and the members affect the show itself. In my first two, I showed the ratings based on the members. I cleaned it up in the second one. Then in my third visualization, I showed how based on the members and rating, where it would fall on the scatter plot. Here, we can see that the shows with more members tend to have higher scores. My next visualization shows the different anime ratings and members but now colored by the different types. We can see that mainly TV shows had the most. In my next visualization, we can see the number of animes in the data set based on type. Again, the one with the most was the TV type. Next, I showed the amount of anime titles by genre. In this graph we can see that the genre with the most titles was comedy and that there was a close running between a couple different categories for the least. Fianlly, my last visualization shows the top 20 anime by ratings and then the amount of members represetned by the size of teh bubble. Here we can see that Death Note has the largest following but Fullmetal Alchemist: Brotherhood had the highest rating