A study on Anime popularity

1. Synopsis

The word anime is the Japanese term for animation, which means all forms of animated media. Outside Japan, anime refers specifically to animation from Japan or as a Japanese-disseminated animation style often characterized by colorful graphics, vibrant characters and fantastical themes. The anime industry consists of over 430 production studios, including major names like Studio Ghibli, Gainax, and Toei Animation. Despite comprising only a fraction of Japan’s domestic film market, anime makes up a majority of Japanese DVD sales. It has also seen international success after the rise of English-dubbed programming. This rise in international popularity has resulted in non-Japanese productions using the anime art style. Whether these works are anime-influenced animation or proper anime is a subject for debate amongst fans.Japanese anime accounts for 60% of the world’s animated television shows, as of 2016.

Problem Statement:

The objective of this project is to explore the Anime dataset to answer interesting questions like what factors affect the ratings, rankings and popularity of an Anime.

Approach

I will be cleaning the data before beginning the analysis. As a part of the cleaning process, I will handle duplicates, outliers, missing values. I will be dropping the variables that look irrelevant. I will be changing data type and value wherever required and making similar changes wherever required to make the data more tidy. I will be analyzing various factors like broadcast day, premiered season, genre and type to predict the popularity of an Anime. I will also be analyzing the number of Animes produced based on different factors like genre, type , season etc and its trend over the last 100 years. This analysis can help producers in getting maximum benefits out of new Animes, be it in monetary terms or in terms of popularity, ratings and fame.

2. Packages

Following packages will be used in the analysis:

tidyverse: set of packages that work in harmony to make it easy to install and load multiple ‘tidyverse’ packages in a single step
ggplot2: initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
dplyr: dplyr provides a flexible grammar of data manipulation. It’s the next iteration of plyr, focused on tools for working with data frames.
DT: Data objects in R can be rendered as HTML tables using this package.
tidyr: ‘tidyr’ contains tools for changing the shape (pivoting) and hierarchy (nesting and ‘unnesting’) of a dataset, turning deeply nested lists into rectangular data frames (‘rectangling’), and extracting values out of string columns. It also includes tools for working with missing values (both implicit and explicit).
kableExtra: simplifies the way to manipulate the HTML or ‘LaTeX’ codes generated by ‘kable()’ and allows users to construct complex tables and customize styles using a readable syntax.
lubridate: has a consistent and memorable syntax that makes working with dates easy and fun.
highcharter: provide a various type of charts, from scatters to heatmaps or treemaps.

library(tidyverse)
library(ggplot2)
library(dplyr)
library(DT)
library(tidyr)
library(kableExtra)
library(lubridate)
library(highcharter)

3. Data Preparation

3.1 Data Source

The dataset used in the study can be found here –> Original dataset.

3.2 Original Dataset

First step is to import the dataset into R-studio.

tidy_anime <- read.csv("C:/Users/nihar/Dropbox/MSIS-2019 Material/Semster 1 Flex 2 (Fall-2019)/data Wrangling/Final Project/anime_data/tidy_anime.csv", stringsAsFactors = FALSE, header = TRUE)

The orignal dataset has following variables:

animeID(int): Anime ID (as in https://myanimelist.net/anime/animeID)
name (character): Describes the title of the Anime.
title_english (character): title in English (sometimes is different, sometimes is missing)
title_japanese (character): title in Japanese (if Anime is Chinese or Korean, the title, if available, in the respective language)
title_synonyms (character): other variants of the title
type (character): anime type (e.g. TV, Movie, OVA)
source (character): source of anime (i.e original, manga, game, music, visual novel etc.)
producers (character): producers
genre(character): genre
studio(character): studio
episodes(double): number of episodes
status(character): Finished airing or currently airing
airing (logical): True/False respectively if it is still airing or has finished airing
start_date (character): Start date (ymd)
end_date (character): End date (ymd)
duration(character): Per episode duration or entire duration, text string
rating(character): Age rating
score(numeric): (higher = better)
scored_by(int): Number of users that scored
rank(int): Rank - weight according to MyAnimeList formula
popularity(int): based on how many members/users have the respective anime in their list
members(int): number members that added this anime in their list
favorites(int): number members that favorites these in their list
synopsis(character): long string with anime synopsis
background (character): long string with production background and other things
premiered (character): anime premiered on season/year
broadcast (character): when is (regularly) broadcasted
related (character): dictionary: related animes, series, games etc.

3.3 Data Cleaning

Before I proceed with data cleaning, I need to understand the dataset, its structure, variable types, missing values, duplicate values etc. I will be doing that in following steps:

Step 1: STUDYING THE BASIC DIMENSIONS OF DATASET:

I will check the dimesions of the dataset using the below code.

Observations:

The dataset has 28 variables and 77911 observations.

dim(tidy_anime)

Step 2: DROPPING IRRELEVANT COLUMNS:

I will look at the different columns associated with the dataset and then drop the ones I feel are not relevant for the study.

names(tidy_anime)

Observations:

There are certain variables like Japanese title, English title, synopsis etc which I don’t think are relevant to my study. I will drop these variables. Later during the course of analysis, I might need to drop other variables too.
The dataset now has 22 columns as compared to 28 in the original dataset

anime_relevant <- select(tidy_anime,-c(3, 4 ,5, 24, 25, 28 ))
dim(anime_relevant) # Verifying the changes

Step 3: CHANGING VARIABLE NAMES:

Observation:

There are certain variable names which are not descriptive of the variable, thus I will change their names to more appropriate names using the below code.

names(anime_relevant)[1]<-"anime_ID"
names(anime_relevant)[9]<-"airing_status"
names(anime_relevant)[14]<-"age_group"
names(anime_relevant)[15]<-"anime_rating"
names(anime_relevant)[18]<-"popularity_score"
names(anime_relevant)[19]<-"member_count"
names(anime_relevant)[20]<-"favourite_count"
names(anime_relevant)[21]<-"season_premiered"
names(anime_relevant)[22]<-"broadcast_timing"

colnames(anime_relevant) # Verifying that the names are changed

Step 4: CHANGING VARIABLE TYPE:

I will check if all the variable types are appropriate and then change the variable type if required.

str(anime_relevant)

Observations:

Certain variables don’t have the appropriate data type and I will change them to more appropriate type.

anime_relevant$start_date <- as.Date(anime_relevant$start_date)
anime_relevant$end_date <- as.Date(anime_relevant$end_date)
anime_relevant$age_group <- as.factor(anime_relevant$age_group)
anime_relevant$type <- as.factor(anime_relevant$type)
anime_relevant$genre <- as.factor(anime_relevant$genre)
anime_relevant$airing_status <- as.factor(anime_relevant$airing_status)

str(anime_relevant) #Verifying that the data type is updated

Step 5: SEPARATING COLUMNS:

Columns Premiered, start_date, end_date, season_premiered, broadcast_timing can be split for the ease of analysis.

Observations:

After splitting the above mentioned columns, the dataset now has 28 columns.

anime_separated <- anime_relevant %>% 
  separate(start_date, c("start_year", "start_month", "start_day"))    %>% 
  separate(end_date, c("end_year", "end_month", "end_day"))   %>%  
  separate(season_premiered, c("premiered_season", "premiered_Year")) %>%  
  separate(broadcast_timing, c("broadcast_day", "blank1", "broadcast_time", "blank2"), sep = " " ) %>% 
      select(-c(blank1, blank2))
  
 
colnames(anime_separated) #Verifying if the columns are seperated
dim(anime_separated) #Verifying number of columns

Step 6: HANDLING DUPLICATE OBSERVATIONS:

Next I will check if there are any duplicate rows and remove them accordingly.

Observations:

The resulting dataset has the same number of observations as the original dataset. This means that there are no duplicate observations in the data.

anime_clean <- unique(anime_separated)
dim(anime_clean)

Step 7: HANDLING MISSING VALUES:

I will check the number of missing values for each column and then decide how to deal with missing values if any.

Observations:

start_day, anime_relevant, episodes, start_year, genre, start_month have some missing values but the number is quite small compared to the size of the dataset so these can be left as such.
studio, end_year, producers, end_day have a high number of missing values but we cannot delete these observations as it will lead to critical loss of information. These don’t appear much relevant for the study and we might have to delete them later in the study, but for now we will leave them as such.
Premiered year, premiered_season have 36248 missing values, but these values can be imputed using the start date. I will be doing that in the next step.

colSums(is.na(anime_clean))

Step 8: IMPUTING MISSING VALUES FOR PREMIERED SEASON AND YEAR COLUMNS:

Before I impute the missing values, I need to change the data type for ‘start_month’ and ‘start_year’ to numeric. After changing the variable type, I will proceed with the imputation of missing values.

Observations:

Imputing the missing values for premiered season and year leaves them with 238 missing values which is the number of missing values in start year and start month columns.

anime_clean$start_month <- as.integer(anime_clean$start_month) #changing data type for start month

anime_clean$start_year <- as.integer(anime_clean$start_year)  #changing data type for start year

anime_clean$premiered_season <- ifelse(anime_clean$start_month %in% c(3,4,5), "Spring",
                                               ifelse(anime_clean$start_month %in% c(6,7,8), "Summer",
                                                      ifelse(anime_clean$start_month %in% c(9,10,11), "Fall",
                                                             ifelse(anime_clean$start_month %in% c(12,1,2), "Winter",
                                                                  no = NA
                                                             )))) # Imputing missing values for premiered month

anime_clean$premiered_Year <- anime_clean$start_year # # Imputing missing values for premiered year

colSums(is.na(anime_clean))#Verifying the changes

str(anime_clean)

Step 9: REMOVING ADDITIONAL COLUMNS:

Since start date and premiered season indicate the same meaning, I will drop one of these variables. I will remove start day, month and year and will keep premiered season and year. I will also drop the column ‘airing’ as it is the same as airing_status.

Observations:

The resulting dataset has 23 columns and 77911 observations.

anime_neat <- select(anime_clean,-c( 10, 11, 12, 13))

colnames(anime_neat)
dim(anime_neat)

Step 10: CHECKING UNIQUE VALUES IN EACH COLUMN AND CHANGING ANY VALUES IF REQUIRED:

Next I will check the unique values in each column to determine of there are any values that need to be changed for more uniformity in the data.

Observations:

The type column has certain values as ‘unknown’. I will change them to NA as these represent missing data.
The Broadcast_time column has both ‘Unknown’ and ‘NA’. I will change the ‘Unknown’ to NA for for improved consistency.
The Broadcast_day column has ‘Unknown’, ‘Not’ and ‘NA’. I will change both ‘Unknown’ and ‘Not’ to NA for improved consistency.
Not not all ID values are unique. Thus I will create a seperate table with only unique ID’s are that there are no duplicates. The reason for this duplication is that several movies have muitlple genres and for each genre a seperate entry is created. All values except Genre are same for these observations.
The resulting dataset after removing duplicate IDs has 13621 observations.

unique(anime_neat$anime_ID)
unique(anime_neat$name)
unique(anime_neat$type)
unique(anime_neat$genre)
unique(anime_neat$source)
unique(anime_neat$producers)
unique(anime_neat$episodes)
unique(anime_neat$airing_status)
unique(anime_neat$end_year)
unique(anime_neat$end_month)
unique(anime_neat$end_day)

unique(anime_neat$duration)
unique(anime_neat$anime_rating)
unique(anime_neat$premiered_season)
unique(anime_neat$premiered_Year)
unique(anime_neat$broadcast_day)
unique(anime_neat$broadcast_time)

anime_neat$broadcast_day[anime_neat$broadcast_day =="Unknown" |anime_neat$broadcast_day =="Not"] <- NA
anime_neat$type[anime_neat$type == "Unknown"] <- NA

unique_id <- unique(anime_neat$anime_ID)
length(unique_id)

#Filtering data for only unique ID values

anime_final <- anime_neat %>% 
  distinct(anime_ID, .keep_all = TRUE)
dim(anime_final)

Step 11: CHECKING FOR OUTLIERS:

I will be using the box plot function to see if there are any outliers

Observation:

Box plot does show some outliers for episodes,anime_rating, scored_by, member_count and favourite_count, however there is no set criteria to decide if these are otuliers or not. For example, the number of episodes can greatly differ amongst different types of animes, so can anime_rating and other variables. Thus I will leave them as such.

boxplot(anime_final$rank)

boxplot(anime_final$episodes)

boxplot(anime_final$anime_rating)

boxplot(anime_final$scored_by)

boxplot(anime_final$rank)

boxplot(anime_final$popularity_score)

boxplot(anime_final$member_count)

boxplot(anime_final$favourite_count)

3.4 Cleaned Dataset

The top 100 observations of the final cleaned dataset can be found below in an interactive table.

output_data <- head(anime_final, n=100)

datatable(output_data, filter = 'top', options = list(pageLength = 25))

3.5 Summary of Variables

The final dataset that I will be using in the study has following variables:

anime_ID (Integer): Anime ID (as in https://myanimelist.net/anime/animeID)
name (Character): Describes the title of the Anime.
type (Factor): anime type (e.g. TV, Movie, OVA)
source (Character): source of anime (i.e original, manga, game, music, visual novel etc.)
producers (Character): producers
genre(Factor): genre
studio(Character): studio
episodes(Integer): number of episodes
airing_status(Factor): Finished airing or currently airing
end_day (Date): End date
end_month (Date): End month
end_year (Date): End year
duration (Character): Per episode duration or entire duration, text string
age_group (Factor): Age rating
anime_rating (Numeric): (higher = better)
scored_by (Integer): Number of users that scored
rank (Integer): Rank - weight according to MyAnimeList formula
popularity_score (Integer): based on how many members/users have the respective anime in their list
member_count (Integer): number members that added this anime in their list
favourite_count (Integer): number members that favorites these in their list
premiered_season (Character):anime premiered on season
premiered_year (Character): anime premiered on year
broadcast_day (Character): which day it is (regularly) broadcasted
broadcast_time (Character): what time it is (regularly) broadcasted (change tyoe)

4. Exploratory Data Analysis

The aim of my study is to analyze the popularity of Anime. I will be measuring the popularity based on Ratings, Rank and Popularity score. I will not be considering scored_by in the study as rating will always be the mean of the total number of ratings and I will also not be taking member count as popularity is indicative of that. I will also not be taking favorite count.

I will be plotting the selected variables against my variables of interest which are Genre, Type, premiered season and broadcast time. I will also be analyzing the number of observations/frequencies based on these variables.

4.1 Genre Analysis

I will be analyzing the Rating, Ranking, Frequency and Popularity score by Genre to see how these are dependent on the Genre. In order to analyze the variables of interest by Genre, I will not be using the dataset where I have removed the duplicate IDs (anime_final), instead I will be using the original clean dataset (anime_neat). My reason for doing so is that the duplicate ID’s have a separate entry for each genre, and if I am just keeping one of these genres, my Genre-analysis would be biased.

Step 1: GENRE VS FREQUENCY:

Observation:

The below two codes output Genres based on the frequency of production and the top 10 Genres.
Number of Animes produced are highest for Comedy followed by Action and Fantasy.

anime_final %>%
  filter(!(is.na(genre))) %>%
  group_by(genre) %>%
  summarize(num_animes = sum(anime_ID)) %>%
  ggplot(aes(x = reorder(genre, +num_animes), y = num_animes, fill=genre)) +
  geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
  scale_fill_brewer(palette = "YlGnBu") +
  geom_text(aes(label=round( num_animes,digit =2)), hjust=2.0, color = "black", size= 1.5) +
  coord_flip() +
  scale_y_continuous(expand = c(0, 100000)) +
  ggtitle("Number of Animes per Genre") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks = element_blank())

anime_neat %>% 
  filter(!(is.na(genre))) %>% 
  group_by(genre) %>% 
  summarise(num_animes = sum(anime_ID, na.rm = TRUE)) %>% 
  top_n(10, num_animes) %>%
  ggplot(aes(x = reorder(genre, num_animes), y = num_animes, fill =genre)) +
  geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
  scale_fill_brewer(palette = "YlGnBu") +
  theme_minimal() +
  coord_flip() +
  labs(x="Genre", y="Number of Animes") +
  ggtitle("Genre and Number of Animes", subtitle = "Number of Animes vs Genre") + 
  geom_text(aes(label=round( num_animes,digit =2)), hjust=2.0, color = "black", size= 3.5) +
  xlab("Genre") + 
  ylab("Number of Animes") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "black", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Step 2: GENRE VS AVERAGE RATING:

Observation:

The below code outputs the top 10 Genres based on highest average ratings.
Thrillers are rated the highest followed by Josei and Mystery.

anime_neat %>% 
  filter(!(is.na(genre))) %>% 
  group_by(genre) %>% 
  summarise(mean_user_rating = mean(anime_rating, na.rm = TRUE)) %>% 
  top_n(10, mean_user_rating) %>%
  ggplot(aes(x = reorder(genre, mean_user_rating), y = mean_user_rating, fill = genre)) +
  geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
  scale_fill_brewer(palette = "YlGnBu") +
  theme_minimal() +
  coord_flip() +
  labs(x="Genre", y="Average Rating") +
  geom_text(aes(label=round(mean_user_rating,digit =2)), hjust=2.0, color = "black", size= 3.5) +
  ggtitle("Genre and Average Ratings", subtitle = "Viewer's Rating vs Genre") + 
  xlab("Genre") + 
  ylab("Average User Ratings") +
  ylim(0,10) +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "black", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Step 3: GENRE VS RANKING:

Observation:

The below code outputs the top 10 Genres based on highest ranking.
Genre ‘Kids’ is are ranked the highest followed by ‘Dementia’ and ‘Music’. This means that that the highest rated Genres are not highest ranked as well and that Ranking is independent of Rating.

anime_neat %>%
  filter(!(is.na(genre))) %>% 
  group_by(genre) %>%
  summarize(rankbygenre = mean(rank, na.rm=TRUE)) %>%
   top_n(10, rankbygenre) %>%
  ggplot(aes(x = reorder(genre, rankbygenre), y = rankbygenre, fill=genre)) + 
  geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
  scale_fill_brewer(palette = "YlGnBu") +
  geom_text(aes(label=round(rankbygenre, digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
  theme_minimal() +
  coord_flip() +
  labs(x="Genre", y="Mean ranking") +
  ggtitle("Rankings By genre") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks = element_blank())

Step 4: GENRE VS POPULARITY SCORE:

Observation:

The below code outputs the top 10 Genres based on highest Popularity.
Genre ‘Kids’ the most popular followed by ‘Dementia’ and ‘Music’. This means that the most popular Genres are not highest Rated as well and that Popularity is independent of Rating.
Top 3 Ranked Genres and Top 3 most popular genres are the same.SO there is relation between ranking and popularity.

anime_neat %>%
  filter(!(is.na(genre))) %>% 
  group_by(genre) %>%
  summarize(popularity_genre = mean(popularity_score, na.rm=TRUE)) %>%
  top_n(10, popularity_genre) %>%
  ggplot(aes(x = reorder(genre, +popularity_genre), y = popularity_genre, fill=genre)) + 
  geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
  scale_fill_brewer(palette = "YlGnBu") +
  geom_text(aes(label=round(popularity_genre,digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
  theme_minimal() +
  coord_flip() +
  labs(x="Genre", y="Popularity Score") +
  ggtitle("Popularity By genre") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks = element_blank())

4.2 Type Analysis

Step 1: TYPE VS COUNT:

Observation:

The below code outputs the number of Animes based on type.
Top 3 types are TV, followed by Movie and Special.

anime_final %>% 
  filter(!(is.na(type))) %>% 
  filter(!(type == "Unknown")) %>% 
  group_by(type) %>% 
  summarise(number_animes = sum(anime_ID, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(type, number_animes), y =number_animes, fill = type)) +
  geom_bar(stat = "identity", colour = "black", width = 0.8, position = position_dodge(width = 0.9)) +
  scale_fill_brewer(palette = "YlGnBu") +
  coord_flip() +
  theme_minimal() +
  labs(x="Anime Type", y="Count") +
  geom_text(aes(label=round(number_animes,digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
  ggtitle("Count by Anime Type", subtitle = "Count vs Anime Type") + 
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Step 2: TYPE VS RATING:

Observation:

The below code outputs the rating of Animes based on Type.
Top 3 types are TV, followed by Special and OVA.

anime_final %>% 
  filter(!(is.na(type))) %>% 
  filter(!(type == "Unknown")) %>% 
  group_by(type) %>% 
  summarise(mean_user_rating = mean(anime_rating, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(type, mean_user_rating), y = mean_user_rating, fill = type)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "YlGnBu") +
  theme_minimal() +
  labs(x="Anime Type", y="Average Rating") +
  geom_text(aes(label=round(mean_user_rating,digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
  ggtitle("Ratings by Anime Type", subtitle = "Average Rating vs Anime Type") + 
  ylim(0,10) +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Step 3: TYPE VS RANKING:

Observation:

The below code outputs the ranking of Animes based on Type.
Music is ranked the highest, followed by ONA and OVA.

anime_final %>% 
  filter(!(is.na(type))) %>% 
  filter(!(type == "Unknown")) %>% 
  group_by(type) %>% 
  summarise(rank_anime = mean(rank, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(type, rank_anime), y =rank_anime, fill = type)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "YlGnBu") +
  theme_minimal() +
  labs(x="Anime Type", y="Rank") +
  geom_text(aes(label=round(rank_anime,digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
  ggtitle("Rank by Anime Type", subtitle = "Rank vs Anime Type") + 
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Step 4: TYPE VS POPULARITY:

Observation:

The below code outputs the popularity of Animes based on Type.
Music is the most popular, followed by ONA and Movie.

anime_final %>% 
  filter(!(is.na(type))) %>% 
  filter(!(type == "Unknown")) %>% 
  group_by(type) %>% 
  summarise(popularity_anime = mean(popularity_score, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(type, popularity_anime), y =popularity_anime, fill = type)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "YlGnBu") +
  theme_minimal() +
  labs(x="Anime Type", y="Popularity Score") +
  geom_text(aes(label=round(popularity_anime,digit =2)), hjust=2.0, color = "darkblue", size= 3.5) +
  ggtitle("Popularity Score by Anime Type", subtitle = "Popularity Score vs Anime Type") + 
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

4.3 Season Premiered

Step 1: SEASON VS COUNT:

Observation:

The below code outputs the number of Animes based on the season they are premiered in.
Highest number of animes are premiered during the Winter, and there isn’t much difference between the other 3 seasons, with fall being the lowest.

Step 2: SEASON VS AVERAGE RATING:

Observation:

The below code outputs the Average Rating of Animes based on the season they are premiered in.
There isn’t any considerable difference in the ratings based on the seasons with highest being ‘Fall’.

Step 3: SEASON VS RANKING:

Observation:

The below code outputs the ranking of Animes based on the season they are premiered in.
Animes produced during winters are ranked the highest and there isn’t any considerable difference in the rankings for other 3 seasons, with fall being the lowest. This also shows that possibly there is no relation between Ranking and Rating.

Step 4: SEASON VS POPULARITY:

Observation:

The below code outputs the popularity of Animes based on the season they are premiered.
The order of popularity is the same as the order of Ranking.
There could be a relation between Ranking and popularity based on seasons.

4.4 Broadcast day

Step 1: BROADCAST DAY VS COUNT:

Observation:

The below code outputs the number of Anime based on the day on which they are broadcasted.
Highest number of anime are broadcasted over the weekend and least on Thursdays.

Step 2: BROADCAST DAY VS AVERAGE RATING:

Observation:

The below code outputs the Average Rating Anime based on the day on which they are broadcasted.
Average rating of Anime broadcasted over the weekend is slightly higher than rest of the days but the difference isn’t that significant.
This could mean that the day of broadcast does not affect the rating of an Anime significantly.

Step 3: BROADCAST DAY VS POPULARITY:

Observation:

The below code outputs the Popularity of Anime based on the day on which they are broadcasted.
Highest popularity score is observed for Anime broadcasted on Mondays followed by Wednesday Saturday. Thus, Popularity is independent of Rating.

Step 4: BROADCAST DAY VS RANKING:

Observation:

The below code outputs the Ranking of Anime based on the day on which they are broadcasted.
Highest ranking is observed for Anime broadcasted on Wednesdays followed by Monday and Thursday. Thus, Ranking is independent of Rating.
Animes broadcasted on Mondays and Wednesdays show the highest popularity and ranking.

4.5 Anime Trend over the years

To analyze the trend of anime over the years, I have dropped the observations for the year 2019. I do not have the complete data for this year and this can lead to biased results.

Observation:

The below code outputs the trend Anime production has followed over the years.
The production of Anime has been following an upwards trend over the years with a few dips here and there. But overall, the production of Anime has been rising over the years.

5. Summary

Summary of the problem To Analyze the Japanese Anime dataset and find the interesting insights related the factors affect the ratings, ranking and popularity of an Anime over last 100 years.

Date Used and Methodology The data is collected from an online datasource placed at this –> Location.

This data comes from Tam Nguyen and MyAnimeList.net via Kaggle. According to Wikipedia - “MyAnimeList, often abbreviated as MAL, is an anime and manga social networking and social cataloging application website. The site provides its users with a list-like system to organize and score anime and manga. It facilitates finding users who share similar tastes and provides a large database on anime and manga. The site claims to have 4.4 million anime and 775,000 manga entries. In 2015, the site received 120 million visitors a month.”

Methodology Various factors like broadcast time, premiered season, genre, type etc. will be analyzed to predict the popularity of an Anime. I will be dropping the variables that look irrelevant as a part of cleaning. During the course of analysis. I will also be analyzing the number of Anime produced based on different factors like genre, type , season etc. This analysis can help producers in getting maximum benefits out of new anime, be it in monetary terms or in terms of popularity, ratings and fame.

Interesting insights from Analysis The above analysis helped me understand the anime dataset better. Following insights were drawn from it:

Number of Anime produced are highest for Comedy followed by Action and Fantasy.
Thrillers are rated the highest followed by Josei and Mystery.
Genre ‘Kids’ is are ranked the highest followed by ‘Dementia’ and ‘Music’. This means that that the highest rated Genres are not highest ranked as well and that Ranking is independent of Rating.
Genre ‘Kids’ the most popular followed by ‘Dementia’ and ‘Music’. This means that that the most popular Genres are not highest ranked as well and that Ranking is independent of Rating.
Top 3 ranked Genres and top most popular genres are the same.
Most Common Anime type is TV, followed by Movie and Special.
Music is the most popular type, followed by ONA and Movie.
Highest number of animes are premiered during the Winter, and there isn’t much difference between the other 3 seasons.
There isn’t any considerable difference the ratings based on the seasons with highest being ‘Fall’
Animes produced during winters are ranked the highest and there isn’t any considerable difference in the rankings for other 3 seasons, with fall being the lowest. This shows that Ranking is independent of rating.
The order of popularity is the same as the order of Ranking. There could be a relation between Ranking and popularity based on seasons.
Highest number of animes are broadcasted over the weekend and least on Thursdays.
Average rating of Animes broadcasted over the weekend is slightly higher than rest of the days but the difference isn’t that significant.
This could mean that the day of broadcast does not affect the rating of an Anime.
Highest popularity score is observed for Animes broadcasted on Mondays followed by Saturday and Wednesday. Thus Popularity is independent of Rating.
Highest ranking is observed for Animes broadcasted on Wednesdays followed by Monday and Thursday. Thus Ranking is independent of Rating.
Animes broadcasted on Mondays and Wednesdays show the highest popularity and ranking.
The production of Animes has been following an upwards trend over the years with a few dips here and there. But overall, the production of Animes has been rising over the years.

Implications to the consumer from analysis This information can be really helpful for multiple categories of the society such as Film Producers, Critics, Viewers etc. Analyzing the Broadcast day Ranking and Popularity of the Anime the producer can decide which days will be the best to broadcast their creation. Similarly, analyzing the premiered season of the year the producers can decide when to premier and critics will be able to know which time of the year they will be able to analyze maximum or minimum number of Anime. Similarly, viewers and producers will be able to know whihc categories of Anime is most popular among viewers.

Limitations in Analysis and future scope of Improvement The users can perform the regression analysis on this dataset to derive a concrete formula to derive estimated viewership, popularity and Rankings.

A study on Anime popularity

Niharika Gupta

November 9, 2019

1. Synopsis

2. Packages

3. Data Preparation

3.1 Data Source

3.2 Original Dataset

3.3 Data Cleaning

Step 1: STUDYING THE BASIC DIMENSIONS OF DATASET:

Step 2: DROPPING IRRELEVANT COLUMNS:

Step 3: CHANGING VARIABLE NAMES:

Step 4: CHANGING VARIABLE TYPE:

Step 5: SEPARATING COLUMNS:

Step 6: HANDLING DUPLICATE OBSERVATIONS:

Step 7: HANDLING MISSING VALUES:

Step 8: IMPUTING MISSING VALUES FOR PREMIERED SEASON AND YEAR COLUMNS:

Step 9: REMOVING ADDITIONAL COLUMNS:

Step 10: CHECKING UNIQUE VALUES IN EACH COLUMN AND CHANGING ANY VALUES IF REQUIRED:

Step 11: CHECKING FOR OUTLIERS:

3.4 Cleaned Dataset

3.5 Summary of Variables

4. Exploratory Data Analysis

4.1 Genre Analysis

Step 1: GENRE VS FREQUENCY:

Step 2: GENRE VS AVERAGE RATING:

Step 3: GENRE VS RANKING:

Step 4: GENRE VS POPULARITY SCORE:

4.2 Type Analysis

Step 1: TYPE VS COUNT:

Step 2: TYPE VS RATING:

Step 3: TYPE VS RANKING:

Step 4: TYPE VS POPULARITY:

4.3 Season Premiered

Step 1: SEASON VS COUNT:

Step 2: SEASON VS AVERAGE RATING:

Step 3: SEASON VS RANKING:

Step 4: SEASON VS POPULARITY:

4.4 Broadcast day

Step 1: BROADCAST DAY VS COUNT:

Step 2: BROADCAST DAY VS AVERAGE RATING:

Step 3: BROADCAST DAY VS POPULARITY:

Step 4: BROADCAST DAY VS RANKING:

4.5 Anime Trend over the years

5. Summary