The word anime is the Japanese term for animation, which means all forms of animated media. Outside Japan, anime refers specifically to animation from Japan or as a Japanese-disseminated animation style often characterized by colorful graphics, vibrant characters and fantastical themes. The anime industry consists of over 430 production studios, including major names like Studio Ghibli, Gainax, and Toei Animation. Despite comprising only a fraction of Japan’s domestic film market, anime makes up a majority of Japanese DVD sales. It has also seen international success after the rise of English-dubbed programming. This rise in international popularity has resulted in non-Japanese productions using the anime art style. Whether these works are anime-influenced animation or proper anime is a subject for debate amongst fans.Japanese anime accounts for 60% of the world’s animated television shows, as of 2016.
Problem Statement:
The objective of this project is to explore the Anime dataset to answer interesting questions like what factors affect the ratings, rankings and popularity of an Anime.
Approach
I will be cleaning the data before beginning the analysis. As a part of the cleaning process, I will handle duplicates, outliers, missing values. I will be dropping the variables that look irrelevant. I will be changing data type and value wherever required and making similar changes wherever required to make the data more tidy. I will be analyzing various factors like broadcast day, premiered season, genre and type to predict the popularity of an Anime. I will also be analyzing the number of Animes produced based on different factors like genre, type , season etc and its trend over the last 100 years. This analysis can help producers in getting maximum benefits out of new Animes, be it in monetary terms or in terms of popularity, ratings and fame.
Following packages will be used in the analysis:
library(tidyverse)
library(ggplot2)
library(dplyr)
library(DT)
library(tidyr)
library(kableExtra)
library(lubridate)
library(highcharter)
The dataset used in the study can be found here –> Original dataset.
First step is to import the dataset into R-studio.
tidy_anime <- read.csv("C:/Users/nihar/Dropbox/MSIS-2019 Material/Semster 1 Flex 2 (Fall-2019)/data Wrangling/Final Project/anime_data/tidy_anime.csv", stringsAsFactors = FALSE, header = TRUE)
The orignal dataset has following variables:
Before I proceed with data cleaning, I need to understand the dataset, its structure, variable types, missing values, duplicate values etc. I will be doing that in following steps:
I will check the dimesions of the dataset using the below code.
Observations:
dim(tidy_anime)
I will look at the different columns associated with the dataset and then drop the ones I feel are not relevant for the study.
names(tidy_anime)
Observations:
anime_relevant <- select(tidy_anime,-c(3, 4 ,5, 24, 25, 28 ))
dim(anime_relevant) # Verifying the changes
Observation:
names(anime_relevant)[1]<-"anime_ID"
names(anime_relevant)[9]<-"airing_status"
names(anime_relevant)[14]<-"age_group"
names(anime_relevant)[15]<-"anime_rating"
names(anime_relevant)[18]<-"popularity_score"
names(anime_relevant)[19]<-"member_count"
names(anime_relevant)[20]<-"favourite_count"
names(anime_relevant)[21]<-"season_premiered"
names(anime_relevant)[22]<-"broadcast_timing"
colnames(anime_relevant) # Verifying that the names are changed
I will check if all the variable types are appropriate and then change the variable type if required.
str(anime_relevant)
Observations:
anime_relevant$start_date <- as.Date(anime_relevant$start_date)
anime_relevant$end_date <- as.Date(anime_relevant$end_date)
anime_relevant$age_group <- as.factor(anime_relevant$age_group)
anime_relevant$type <- as.factor(anime_relevant$type)
anime_relevant$genre <- as.factor(anime_relevant$genre)
anime_relevant$airing_status <- as.factor(anime_relevant$airing_status)
str(anime_relevant) #Verifying that the data type is updated
Columns Premiered, start_date, end_date, season_premiered, broadcast_timing can be split for the ease of analysis.
Observations:
anime_separated <- anime_relevant %>%
separate(start_date, c("start_year", "start_month", "start_day")) %>%
separate(end_date, c("end_year", "end_month", "end_day")) %>%
separate(season_premiered, c("premiered_season", "premiered_Year")) %>%
separate(broadcast_timing, c("broadcast_day", "blank1", "broadcast_time", "blank2"), sep = " " ) %>%
select(-c(blank1, blank2))
colnames(anime_separated) #Verifying if the columns are seperated
dim(anime_separated) #Verifying number of columns
Next I will check if there are any duplicate rows and remove them accordingly.
Observations:
anime_clean <- unique(anime_separated)
dim(anime_clean)
I will check the number of missing values for each column and then decide how to deal with missing values if any.
Observations:
colSums(is.na(anime_clean))
Before I impute the missing values, I need to change the data type for ‘start_month’ and ‘start_year’ to numeric. After changing the variable type, I will proceed with the imputation of missing values.
Observations:
anime_clean$start_month <- as.integer(anime_clean$start_month) #changing data type for start month
anime_clean$start_year <- as.integer(anime_clean$start_year) #changing data type for start year
anime_clean$premiered_season <- ifelse(anime_clean$start_month %in% c(3,4,5), "Spring",
ifelse(anime_clean$start_month %in% c(6,7,8), "Summer",
ifelse(anime_clean$start_month %in% c(9,10,11), "Fall",
ifelse(anime_clean$start_month %in% c(12,1,2), "Winter",
no = NA
)))) # Imputing missing values for premiered month
anime_clean$premiered_Year <- anime_clean$start_year # # Imputing missing values for premiered year
colSums(is.na(anime_clean))#Verifying the changes
str(anime_clean)
Since start date and premiered season indicate the same meaning, I will drop one of these variables. I will remove start day, month and year and will keep premiered season and year. I will also drop the column ‘airing’ as it is the same as airing_status.
Observations:
anime_neat <- select(anime_clean,-c( 10, 11, 12, 13))
colnames(anime_neat)
dim(anime_neat)
Next I will check the unique values in each column to determine of there are any values that need to be changed for more uniformity in the data.
Observations:
unique(anime_neat$anime_ID)
unique(anime_neat$name)
unique(anime_neat$type)
unique(anime_neat$genre)
unique(anime_neat$source)
unique(anime_neat$producers)
unique(anime_neat$episodes)
unique(anime_neat$airing_status)
unique(anime_neat$end_year)
unique(anime_neat$end_month)
unique(anime_neat$end_day)
unique(anime_neat$duration)
unique(anime_neat$anime_rating)
unique(anime_neat$premiered_season)
unique(anime_neat$premiered_Year)
unique(anime_neat$broadcast_day)
unique(anime_neat$broadcast_time)
anime_neat$broadcast_day[anime_neat$broadcast_day =="Unknown" |anime_neat$broadcast_day =="Not"] <- NA
anime_neat$type[anime_neat$type == "Unknown"] <- NA
unique_id <- unique(anime_neat$anime_ID)
length(unique_id)
#Filtering data for only unique ID values
anime_final <- anime_neat %>%
distinct(anime_ID, .keep_all = TRUE)
dim(anime_final)
I will be using the box plot function to see if there are any outliers
Observation:
boxplot(anime_final$rank)
boxplot(anime_final$episodes)
boxplot(anime_final$anime_rating)
boxplot(anime_final$scored_by)
boxplot(anime_final$rank)
boxplot(anime_final$popularity_score)
boxplot(anime_final$member_count)
boxplot(anime_final$favourite_count)
The top 100 observations of the final cleaned dataset can be found below in an interactive table.
output_data <- head(anime_final, n=100)
datatable(output_data, filter = 'top', options = list(pageLength = 25))
The final dataset that I will be using in the study has following variables:
The aim of my study is to analyze the popularity of Anime. I will be measuring the popularity based on Ratings, Rank and Popularity score. I will not be considering scored_by in the study as rating will always be the mean of the total number of ratings and I will also not be taking member count as popularity is indicative of that. I will also not be taking favorite count.
I will be plotting the selected variables against my variables of interest which are Genre, Type, premiered season and broadcast time. I will also be analyzing the number of observations/frequencies based on these variables.
I will be analyzing the Rating, Ranking, Frequency and Popularity score by Genre to see how these are dependent on the Genre. In order to analyze the variables of interest by Genre, I will not be using the dataset where I have removed the duplicate IDs (anime_final), instead I will be using the original clean dataset (anime_neat). My reason for doing so is that the duplicate ID’s have a separate entry for each genre, and if I am just keeping one of these genres, my Genre-analysis would be biased.
Observation:
anime_final %>%
filter(!(is.na(genre))) %>%
group_by(genre) %>%
summarize(num_animes = sum(anime_ID)) %>%
ggplot(aes(x = reorder(genre, +num_animes), y = num_animes, fill=genre)) +
geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
scale_fill_brewer(palette = "YlGnBu") +
geom_text(aes(label=round( num_animes,digit =2)), hjust=2.0, color = "black", size= 1.5) +
coord_flip() +
scale_y_continuous(expand = c(0, 100000)) +
ggtitle("Number of Animes per Genre") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.ticks = element_blank())
anime_neat %>%
filter(!(is.na(genre))) %>%
group_by(genre) %>%
summarise(num_animes = sum(anime_ID, na.rm = TRUE)) %>%
top_n(10, num_animes) %>%
ggplot(aes(x = reorder(genre, num_animes), y = num_animes, fill =genre)) +
geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
scale_fill_brewer(palette = "YlGnBu") +
theme_minimal() +
coord_flip() +
labs(x="Genre", y="Number of Animes") +
ggtitle("Genre and Number of Animes", subtitle = "Number of Animes vs Genre") +
geom_text(aes(label=round( num_animes,digit =2)), hjust=2.0, color = "black", size= 3.5) +
xlab("Genre") +
ylab("Number of Animes") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "black", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
Observation:
anime_neat %>%
filter(!(is.na(genre))) %>%
group_by(genre) %>%
summarise(mean_user_rating = mean(anime_rating, na.rm = TRUE)) %>%
top_n(10, mean_user_rating) %>%
ggplot(aes(x = reorder(genre, mean_user_rating), y = mean_user_rating, fill = genre)) +
geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
scale_fill_brewer(palette = "YlGnBu") +
theme_minimal() +
coord_flip() +
labs(x="Genre", y="Average Rating") +
geom_text(aes(label=round(mean_user_rating,digit =2)), hjust=2.0, color = "black", size= 3.5) +
ggtitle("Genre and Average Ratings", subtitle = "Viewer's Rating vs Genre") +
xlab("Genre") +
ylab("Average User Ratings") +
ylim(0,10) +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "black", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
Observation:
anime_neat %>%
filter(!(is.na(genre))) %>%
group_by(genre) %>%
summarize(rankbygenre = mean(rank, na.rm=TRUE)) %>%
top_n(10, rankbygenre) %>%
ggplot(aes(x = reorder(genre, rankbygenre), y = rankbygenre, fill=genre)) +
geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
scale_fill_brewer(palette = "YlGnBu") +
geom_text(aes(label=round(rankbygenre, digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
theme_minimal() +
coord_flip() +
labs(x="Genre", y="Mean ranking") +
ggtitle("Rankings By genre") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.ticks = element_blank())
Observation:
anime_neat %>%
filter(!(is.na(genre))) %>%
group_by(genre) %>%
summarize(popularity_genre = mean(popularity_score, na.rm=TRUE)) %>%
top_n(10, popularity_genre) %>%
ggplot(aes(x = reorder(genre, +popularity_genre), y = popularity_genre, fill=genre)) +
geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
scale_fill_brewer(palette = "YlGnBu") +
geom_text(aes(label=round(popularity_genre,digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
theme_minimal() +
coord_flip() +
labs(x="Genre", y="Popularity Score") +
ggtitle("Popularity By genre") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.ticks = element_blank())
Observation:
anime_final %>%
filter(!(is.na(type))) %>%
filter(!(type == "Unknown")) %>%
group_by(type) %>%
summarise(number_animes = sum(anime_ID, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(type, number_animes), y =number_animes, fill = type)) +
geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
scale_fill_brewer(palette = "YlGnBu") +
coord_flip() +
theme_minimal() +
labs(x="Anime Type", y="Count") +
geom_text(aes(label=round(number_animes,digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
ggtitle("Count by Anime Type", subtitle = "Count vs Anime Type") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
Observation:
anime_final %>%
filter(!(is.na(type))) %>%
filter(!(type == "Unknown")) %>%
group_by(type) %>%
summarise(mean_user_rating = mean(anime_rating, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(type, mean_user_rating), y = mean_user_rating, fill = type)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "YlGnBu") +
theme_minimal() +
labs(x="Anime Type", y="Average Rating") +
geom_text(aes(label=round(mean_user_rating,digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
ggtitle("Ratings by Anime Type", subtitle = "Average Rating vs Anime Type") +
ylim(0,10) +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
Observation:
anime_final %>%
filter(!(is.na(type))) %>%
filter(!(type == "Unknown")) %>%
group_by(type) %>%
summarise(rank_anime = mean(rank, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(type, rank_anime), y =rank_anime, fill = type)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "YlGnBu") +
theme_minimal() +
labs(x="Anime Type", y="Rank") +
geom_text(aes(label=round(rank_anime,digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
ggtitle("Rank by Anime Type", subtitle = "Rank vs Anime Type") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
Observation:
anime_final %>%
filter(!(is.na(type))) %>%
filter(!(type == "Unknown")) %>%
group_by(type) %>%
summarise(popularity_anime = mean(popularity_score, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(type, popularity_anime), y =popularity_anime, fill = type)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "YlGnBu") +
theme_minimal() +
labs(x="Anime Type", y="Popularity Score") +
geom_text(aes(label=round(popularity_anime,digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
ggtitle("Popularity Score by Anime Type", subtitle = "Popularity Score vs Anime Type") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
Observation:
Observation:
Observation:
Observation:
Observation:
Observation:
Observation:
Observation:
To analyze the trend of anime over the years, I have dropped the observations for the year 2019. I do not have the complete data for this year and this can lead to biased results.
Observation:
Summary of the problem To Analyze the Japanese Anime dataset and find the interesting insights related the factors affect the ratings, ranking and popularity of an Anime over last 100 years.
Date Used and Methodology The data is collected from an online datasource placed at this –> Location.
This data comes from Tam Nguyen and MyAnimeList.net via Kaggle. According to Wikipedia - “MyAnimeList, often abbreviated as MAL, is an anime and manga social networking and social cataloging application website. The site provides its users with a list-like system to organize and score anime and manga. It facilitates finding users who share similar tastes and provides a large database on anime and manga. The site claims to have 4.4 million anime and 775,000 manga entries. In 2015, the site received 120 million visitors a month.”
Methodology Various factors like broadcast time, premiered season, genre, type etc. will be analyzed to predict the popularity of an Anime. I will be dropping the variables that look irrelevant as a part of cleaning. During the course of analysis. I will also be analyzing the number of Anime produced based on different factors like genre, type , season etc. This analysis can help producers in getting maximum benefits out of new anime, be it in monetary terms or in terms of popularity, ratings and fame.
Interesting insights from Analysis The above analysis helped me understand the anime dataset better. Following insights were drawn from it:
Implications to the consumer from analysis This information can be really helpful for multiple categories of the society such as Film Producers, Critics, Viewers etc. Analyzing the Broadcast day Ranking and Popularity of the Anime the producer can decide which days will be the best to broadcast their creation. Similarly, analyzing the premiered season of the year the producers can decide when to premier and critics will be able to know which time of the year they will be able to analyze maximum or minimum number of Anime. Similarly, viewers and producers will be able to know whihc categories of Anime is most popular among viewers.
Limitations in Analysis and future scope of Improvement The users can perform the regression analysis on this dataset to derive a concrete formula to derive estimated viewership, popularity and Rankings.