The_World_of_Anime.utf8

1. Synopsis

MyAnimeList, often abbreviated as MAL, is an anime and manga social networking and social cataloging application website. The site provides its users with a list-like system to organize and score anime and manga. It facilitates finding users who share similar tastes and provides a large database on anime and manga.

Anime without rankings or popularity scores were excluded. Producers, genre, and studio were converted from lists to tidy observations, so there will be repetitions of shows with multiple producers, genres, etc.

Problem Statement

This development has been carried out to analyse the various factors that influence the popularity or rank of a particular anime.

Implementation

The data was cleaned and shaped accordingly to carry out the analysis and infer the results.

2. Packages Required

library(tidyr)
library(DT)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(kableExtra)
library(lubridate)
library(readxl)
library(highcharter)
library(lubridate)
library(scales)
library(RColorBrewer)
library(wesanderson)
library(plotly)
library(shiny)
library(readxl)

Package	Description
library(tidyr)	For changing the layout of your data sets, to convert data into the tidy format
library(DT)	For HTML display of data
library(ggplot2)	For customizable graphical representation
library(dplyr)	For data manipulation
library(tidyverse)	Collection of R packages designed for data science that works harmoniously with other packages
library(kableExtra)	To display table in a fancy way
library(lubridate)	Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not
library(readxl)	The readxl package makes it easy to get data out of Excel and into R
library(highcharter)	Highcharter is a R wrapper for Highcharts javascript libray and its modules
library(scales)	The idea of the scales package is to implement scales in a way that is graphics system agnostic
library(RColorBrewer)	RColorBrewer is an R package that allows users to create colourful graphs with pre-made color palettes that visualize data in a clear and distinguishable manner
library(wesanderson)	A Wes Anderson is color palette for R
library(plotly)	Plotly’s R graphing library makes interactive, publication-quality graphs
library(shiny)	Shiny is an R package that makes it easy to build interactive web apps straight from R

3. Data Preparation

3.1 Data Source

The data used in the analysis can be found here

3.2 Original Dataset

The original Dataset that has been used for this project can be found here

3.3 Data Cleaning

The column Premiered has both the season and the year combined. We are splitting this column into two columns, Premiered Season and Premiered Year as now both the columns will have information about one entity and the analysis can be done based on both season as well as year.

For a similar reason, we are splitting the Broadcast column into Day_of_week and Time to help in our analysis.

anime_clean <- tidy_anime %>% 
  separate(premiered, c("Premiered_Season", "Premiered_Year")) %>% 
    separate(broadcast, c("Day_of_Week", "Not_Needed1", "Time", "Not_Needed_2"), sep = " " ) %>% 
      select(-c(Not_Needed1,Not_Needed_2))

A lot of columns does not provide any useful information for us in our analysis. Thus, going forward, its better to filter out those columns and restrict our analysis to the columns of our interest or the columns which provide valuable insights from the given data.

For this, we are removing the following columns from our dataset:

Title - English
Title - Japanese
Title - Synonyms
Background
Synopsis
Related
Status
End Date

anime_final <- select(anime_clean, -c(title_english, title_japanese, title_synonyms, background, synopsis, related,status,end_date))

After removing the unneccesary columns, we rename all the column names with appropiate names using the snake_case

names(anime_final) <- c("anime_id", "anime_name", "anime_type", "source", "producers", "genre", "studio", "no_of_episodes", "airing_status", "start_date", "episode_duration", "MPAA_rating", "viewers_rating", "rated_by_no_of_viewers", "rankings", "popularity_index", "wishlisted_members", "favorites", "premiered_season", "premiered_year", "broadcast_day", "broadcast_time")

Now, we try to replace the missing values in premiered_season. For this, we extract the month value from start_date, and categorize it with the 4 seasons respectively. Wherever it is not possible to replace because of insufficient data, we replace them with NA.

anime_final$premiered_season <- ifelse(as.numeric(format.Date(anime_final$start_date, "%m")) %in% c(3,4,5), "Spring",
                                ifelse(as.numeric(format.Date(anime_final$start_date, "%m")) %in% c(6,7,8), "Summer",
                                ifelse(as.numeric(format.Date(anime_final$start_date, "%m")) %in% c(9,10,11), "Fall",
                                ifelse(as.numeric(format.Date(anime_final$start_date, "%m")) %in% c(12,1,2), "Winter",
                                no = NA ))))

After checking the summary of the data, we observe that we need to encode the Unknown values of Anime Type to NA

anime_final$anime_type[anime_final$anime_type == "Unknown"] <- NA

For the broadcast day column, we need to encode (Other) and Unknown to NA.

anime_final$broadcast_day[anime_final$broadcast_day == "Not"] <- NA
anime_final$broadcast_day[anime_final$broadcast_day == "Unknown"] <- NA

It would be better for our analysis to make the following variables as categorical variables instead of character variables: * Type * Genre * Rating * Premiered Season * Day of Week

anime_final %>% mutate_at(.vars = c("anime_type", "genre", "MPAA_rating", "premiered_season", "broadcast_day"), .funs = as.factor)

The column Start_Date is a character variable. Converting them to Date variables would help in further analysis.

anime_final$start_date <- as.Date(anime_final$start_date)
anime_final$premiered_year <- as.numeric(anime_final$premiered_year)

3.4 Cleaned Dataset

The final cleaned dataset can be found below in an interactive table.

datatable(anime_final, filter = 'top')

3.5 Summary of Variables

#Reading the variable summary excel File
var_sum <- read_excel("D:/College/Github/Projects/Anime_Analysis/variable_summary.xlsx")

kable(var_sum) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T, )

Variable	Class	Description
anime_id	int	anime ID (as in https://myanimelist.net/anime/animeID)
anime_name	character	anime title - extracted from the site.
anime_type	factor	anime type (e.g. TV, Movie, OVA)
source	character	source of anime (i.e original, manga, game, music, visual novel etc.)
producers	character	producers
genre	factor	genre
studio	character	studio
no_of_episodes	int	number of episodes
airing_status	character	True/False- is still airing?
start_date	date	start date
episode_duration	character	per episode duration or entire duration, text string
MPAA_rating	factor	age rating
viwers_rating	num	score (higher = better)
rated_by_no_of_viewers	int	number of users that scored
rankings	int	rank - weight according to MyAnimeList formula
popularity_index	int	based on how many members/users have the respective anime in their list
wishlisted_members	int	number members that added this anime in their list
favorites	int	number members that favorites these in their list
premiered_season	factor	the season the show first premiered
premiered_year	num	the year the show first premiered
broadcast_day	factor	the day of the week it gets broadcasted
broadcast_time	character	the time of the day it gets broadcasted

4. Exploratory Data Analysis

On the off chance that we intently watch the Anime dataset, we can see that every anime has different passages dependent on the quantity of sorts into which it tends to be characterized. Along these lines, for our examination it would be useful on the off chance that we can make a dataset with a solitary section for every anime.

unique_anime <- anime_final %>% distinct(anime_id, .keep_all = TRUE)

Anime Trend

data_line_year <- unique_anime %>% 
  filter(premiered_year != 2019) %>% 
  filter(!is.na(premiered_year)) %>% 
  group_by("Year" = premiered_year) %>% 
  summarise(Freq = n())

highchart() %>% 
  hc_add_series(data_line_year,
                hcaes(x = Year,
                      y = Freq,
                      color = Freq),
                type = "line") %>% 
  hc_title(text = "Number of animes per year") %>% 
  hc_xAxis(title = list(text = "Year")) %>% 
  hc_yAxis(title = list(text = "Number of Anime")) %>% 
  hc_tooltip(useHTML = T,
             borderWidth = 1.5,
             headerFormat = "",
             pointFormat = paste("{point.Year}
                                   
                                   {point.Freq} animes")) %>% 
  hc_legend(enabled = F)

In the graph above, we display the number of animes per year starting from 1961 to 2018. The treand we see, is that the number of animes kept on increasing with every year passing. We also observe that there is a significant increase in the anime count from 2005 to 2006 and sudden drop in count from 2006 to 2007. We observe another significant drop during the tenure of 2009-2010. Post that we see, that the anime count per year has kept on increasing till 2018. Over the years, we can conclude that, collectively there has been a growth in the anime culture as it has increased from 1 anime from 1961 to 254 anime in 2018.

Anime Type

Manga and anime are popular for many people around the world and has been one of Japan’s most lucrative businesses. Over the years the popularity of anime has increased significantly across the globe. This trend can be clearly analysed by the line graph below:

unique_anime %>% 
  filter(!(is.na(anime_type))) %>% 
  filter(!(anime_type == "Unknown")) %>% 
  group_by(anime_type) %>% 
  summarise(mean_user_rating = mean(viewers_rating, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(anime_type, mean_user_rating), y = mean_user_rating, fill = anime_type)) +
  geom_col(stat ="identity", color = "black", fill="#641E16") +
  coord_flip() +
  theme_gray() +
  labs(x = "Anime Type", y = "Viewer's Rating") +
  geom_text(aes(label = round(mean_user_rating,digit = 2)), hjust = 2.0, color = "white", size = 3.5) +
  ggtitle("Top 20 Popular Anime Type", subtitle = "Viewer's Rating vs Anime Type") + 
  xlab("Anime Type") + 
  ylab("Mean User Ratings") +
  ylim(0,10) +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

From the above graph, we can conclude that the most famous anime-type is TV followed by Special and OVA. Television has a mean viewer rating of 6.75 which is the most noteworthy among all other anime types.

Anime Genre

anime_final %>% 
  group_by(genre) %>% 
  summarise( mean_rating = mean(viewers_rating, na.rm = TRUE)) %>% 
  top_n(10, wt = mean_rating) %>% 
  ggplot(aes(x = reorder(genre, mean_rating), y = mean_rating, fill = genre)) + 
  geom_col(stat = "identity", color = "black", fill = "#1F618D") +
  scale_fill_brewer(palette = "BrBG") +
  coord_flip() +
  theme_grey() +
  geom_text(aes(label = round(mean_rating,digit = 2)), hjust = 2.0, color = "#D4E6F1", size = 3.5) +
  ggtitle("Top 10 Popular Anime Genre", subtitle = "Genre vs Mean User Rating") + 
  xlab("Anime Genre") + 
  ylab("Mean User Ratings") +
  ylim(0,10) +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

The above chart gives us the top 10 anime classifications. The best 10 all are appraised over 7 with Thriller being the most elevated among them all. This demonstrates that the thriller anime genre is generally preferred by the viewers when contrasted with all other anime kinds.

 plot_ly(
  type = 'scatter',
  x = anime_final$genre,
  y = anime_final$viewers_rating,
  mode = 'markers',
  transforms = list(
    list(
      type = 'aggregate',
      groups = anime_final$genre,
      aggregations = list(
        list(
          target = 'y', func = 'count', enabled = T
        )
      )
    )
  )
)

From the above graph, we can conclude that Comedy Genre has the maximum number of viewer ratings and the Sports Genre has the least number of viewer ratings. Thus, they can be classified as the most popular and the least popular genres respectively.

Now lets figure out the most popular genres in the anime industry. We can conclude this by counting the number of anime in each genre and representing them via graph.

anime_final %>%
  filter(!(is.na(genre))) %>% 
  group_by(genre) %>%
  summarize(total = sum(anime_id)) %>%
  ggplot(aes(x = reorder(genre, +total), y = total)) + 
  geom_bar(stat = "identity", fill = "#5B2C6F", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) + 
  coord_flip() + 
  scale_y_continuous(expand = c(0, 100000)) + 
  ggtitle("Number of Animes per Genre") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks = element_blank())

From the graph above we can easily conclude that the Comedy Genre is the most popular amongst all the anime genres followed by Action and Fantasy. Amongst the least popular genres in anime industry are Shounen AI, Shoujo Ai and Cars.

Broadcasting Days

Now we will be analysing the Popularity Index, Number of users who rated and the mean user rating against the Brodcasting Day, to figure out which day is most appropiate for broadcasting the show to gain the most attention from the viewers.

anime_final %>% 
  filter(!(is.na(broadcast_day))) %>% 
  filter(!(broadcast_day == "Unknown")) %>% 
  filter(!(broadcast_day == "Not")) %>% 
  group_by(broadcast_day) %>% 
  summarise( mean_popularity_index = mean(popularity_index, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(broadcast_day, mean_popularity_index), y = mean_popularity_index, fill = broadcast_day)) +
  geom_col(stat = "identity", color= "black", fill = "#D35400") +
  coord_flip() +
  geom_text(aes(label = round(mean_popularity_index,digit = 2)), hjust = 2.0, color = "#F6DDCC", size = 3.5) +
  theme_grey() +
  ggtitle("Popular Broadcasting Days", subtitle = "Broadcast Day vs Mean Popularity Index") + 
  xlab("Broadcast Day") + 
  ylab("Mean Popularity Index") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Here, we are analysing the Broadcasting Day vs Mean Popularity Index, to which we see that the shows on Monday are the most popular with a significant high mean popularity index when compared to the shows on weekends.

anime_final %>% 
  filter(!(is.na(broadcast_day))) %>% 
  filter(!(broadcast_day == "Unknown")) %>% 
  filter(!(broadcast_day == "Not")) %>% 
  group_by(broadcast_day) %>% 
  summarise(show_count = n()) %>% 
  ggplot(aes(x = reorder(broadcast_day, show_count), y = show_count, fill = broadcast_day)) +
  geom_col(stat = "identity", color = "black", fill = "#F1C40F") +
  coord_flip() +
  geom_text(aes(label = show_count), hjust = 2.0, color = "#17202A", size = 3.5) +
  theme_grey() +
  ggtitle("Popular Broadcasting Days", subtitle = "Broadcast Day vs No of Shows") + 
  xlab("Broadcast Day") + 
  ylab("Number of Shows") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Over here, we observe that the producers broadcast more number of shows on weekends as compared to weekdays. The most number of shows are broadcasted on Fridays, followed by Sundays and Saturdays. The Least number of shows are produced on Wednesdays

unique_anime %>% 
  filter(!(is.na(broadcast_day))) %>% 
  filter(!(broadcast_day == "Unknown")) %>% 
  filter(!(broadcast_day == "Not")) %>% 
  group_by(broadcast_day) %>% 
  summarise( mean_rating = mean(viewers_rating, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(broadcast_day, mean_rating), y = mean_rating, fill = broadcast_day)) +
  geom_col(stat = "identity", color ="black", fill = "#B03A2E") +
  coord_flip() +
  scale_fill_brewer(palette = "OrRd") +
  theme_minimal() +
  geom_text(aes(label = round(mean_rating, digit = 2)), hjust = 2, color = "#FDFEFE", size = 3.5) +
  ggtitle("Popular Broadcasting Days", subtitle = "Broadcasting Day vs Mean User Ratings") + 
  xlab("Broadcast Day") + 
  ylab("Mean User Ratings") +
  ylim(0,10) +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

The above chart, shows the mean user ratings for the shows which are broadcasted on different days of the week. From this we can conclude, the most popular and top rated shows are generally aired on the weekends - Friday, Saturday and Sundays.

Premiered Season

Here we try to figure out which season is the best season or the most popular season amongst the viewers for watching anime.

unique_anime %>% 
  filter(!(is.na(premiered_season))) %>% 
  filter(!(premiered_season == "NA")) %>%
  group_by(premiered_season) %>% 
  summarise( mean_rating = mean(popularity_index, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(premiered_season, mean_rating), y = mean_rating, fill = premiered_season)) +
  geom_col(stat = "identity", color = "black", fill = "#515A5A") +
  coord_flip() +
  scale_fill_brewer(palette = "Greys") +
  theme_minimal() +
  geom_text(aes(label = round(mean_rating, digit = 2)), hjust = 2, color = "#EAECEE", size = 3.5) +
  ggtitle("Premiered Season Comparison", subtitle = "Premiered Season vs Popularity Index" ) + 
  xlab("Premiered Season") + 
  ylab("Mean Popularity Index") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(), 
        axis.ticks = element_blank())

The mean popularity index of the shows on winters is approx 1300+ points higher than the summer season. For this, we can conclude that viewers mostly watch anime during winters holidays followed by summer holidays. The popularity of shows in spring and fall is relatively less.

unique_anime %>% 
  filter(!(is.na(premiered_season))) %>% 
  filter(!(premiered_season == "NA")) %>%
  group_by(premiered_season) %>% 
  summarise( mean_rating = mean(viewers_rating, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(premiered_season, mean_rating), y = mean_rating, fill = premiered_season)) +
  geom_col(stat = "identity", color = "black", fill = "#FFD700") +
  coord_flip() +
  scale_fill_brewer(palette = "Oranges") +
  theme_grey() +
  geom_text(aes(label = round(mean_rating, digit = 2)), hjust = 2, color = "black", size = 3.5) +
  ggtitle("Premiered Season Comparison", subtitle = "Premiered Season vs Mean User Rating" ) + 
  xlab("Premiered Season") + 
  ylab("Mean User Ratings") +
  ylim(0,10) +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(), 
        axis.ticks = element_blank())

From the above grapgh we can conclude that the animes in Fall have the highest rating but the other seasons are not that different as well. Thus, we can say that good anime shows are released throughout the year.

Anime Source

Lets now look at the sources of anime which are the most popular.

From the graph we can conclude that a lot of shows are original works followed by Manga and Game.

unique_anime %>%
  filter(source != 'Unknown') %>% 
  count(source, sort = TRUE) %>%
  ggplot(aes(x = reorder(source, +n), y = n)) +
  geom_bar(stat = "identity", fill = "red", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) + 
  coord_flip() + 
  geom_text(aes(label = n), hjust = 2, color = "darkblue", size = 3.5) +
  scale_y_continuous(expand = c(0, 0)) + 
  ggtitle("Number of Animes per Source") +
  xlab("Anime Source")+
  ylab("Count") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Producers

unique_anime %>%
  filter(!is.na(producers)) %>%
  group_by(producers) %>%
  summarise(total_anime = n()) %>%
  top_n(16, wt = producers) %>% 
  ggplot(aes(x = reorder(producers, +total_anime), y = total_anime)) +
  geom_bar(stat = "identity",fill = "black", colour = "yellow", width = 0.9, position = position_dodge(width = 0.9)) +
  coord_flip() +
  theme_minimal() +
  geom_text(aes(label = round(total_anime, digit = 2)), hjust = 2, color = "yellow", size = 3.5) +
  ggtitle("Most Popular Producers") +
  xlab("Producers") +
  ylab("Number of Anime") +
  ylim(0,35) +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Over here, we analyse the number of producers who produce the most number of animes. So, we see that Yomiuri Telecasting & Yomiko Advertising are way ahead of their competitors when it compares to the number of animes being produced by a single producer/producing company. Over here we display the top 15 producers only.

Regression Analysis

We now try to figure out if a there is a relationship between the people who rated a particular show and the ratings of the show.

On performing a linear regression analysis, we obtain a R-square value of 0.43. Thus, we can conclude that 43 percent of the shows which have high number of raters have a good average rating as well.

The code and the linear regression model can be found below:

reg <- unique_anime %>% 
  filter(rated_by_no_of_viewers > 99 & airing_status == FALSE)

regres <- summary(lm(viewers_rating ~ log(rated_by_no_of_viewers), data = reg))$r.squared
txt <- substitute(R^2 == regres, list(regres = format(regres, digits = 2)))

unique_anime %>% 
  filter(airing_status == FALSE & rated_by_no_of_viewers > 99) %>%
  ggplot(aes(x = rated_by_no_of_viewers, y = viewers_rating)) +
  xlab("No of people who rated") +
  ylab("Ratings of viewers") +
  stat_bin_hex(bins = 50) +
  scale_fill_distiller(palette = "Spectral") +
  stat_smooth(method = "lm", 
              color = "orchid", 
              size = 1.5, 
              formula = y ~ log(x)) +
  annotate("text", 
           label = as.character(as.expression(txt)), 
           parse = TRUE, 
           color = "orchid", 
           x = 750000, 
           y = 2.5, 
           size = 7)

5. Conclusion

As a producer, there are a lot of factors which one should consider before investing money in a particular anime.

Thus, from our analysis results obtained so far, we can figure out the factors which the producers can consider before investing money in a particular anime:

Anime Type: The most popular anime type is TV. Thus, that should be the first choice for the producers.
Broadcasting Days: On broadcasting the show on Mondays, the show has the maximum popularity index. Thus, Monday will be the best bet to broadcast the show to gain maximum popularity.
Premiered Season: The season which has the most mean popularity index is the winter season as a lot of people have holidays during that time. Thus, the best bet for producers will be to release their anime in the winter season as compared to the other seasons of the year.

Hence, we can conclude that the best bet for the producers to obtain a high ROI would be to invest in TV Anime which are premiered in Winter and broadcasted on Monday.

The World of Anime

Sayak Chakraborty | Vipul Mayank