1. Synopsis

The word anime is the Japanese term for animation, which means all forms of animated media. Outside Japan, anime refers specifically to animation from Japan or as a Japanese-disseminated animation style often characterized by colorful graphics, vibrant characters and fantastical themes. The anime industry consists of over 430 production studios, including major names like Studio Ghibli, Gainax, and Toei Animation. Despite comprising only a fraction of Japan’s domestic film market, anime makes up a majority of Japanese DVD sales. It has also seen international success after the rise of English-dubbed programming. This rise in international popularity has resulted in non-Japanese productions using the anime art style. Whether these works are anime-influenced animation or proper anime is a subject for debate amongst fans.Japanese anime accounts for 60% of the world’s animated television shows, as of 2016.

Problem Statement:

The objective of this project is to explore the Anime dataset to answer interesting questions like what factors affect the ratings and popularity of an Anime.

Approach

Various factors like age category, broadcast time, premiered season, genre, type etc. will be analyzed to predict the popularity of an Anime. I will be dropping the variables that look irrelevant as a part of cleaning. During the course of analysis, I will be dropping the variables which show no co-relation with the popularity, ratings or rankings of the anime. This approach can help me compare the relevant variables and help me derieve conclusions of what factors decide the popularity of an Anime. This can help producers in getting maximum benfits out of new animes, be it in monetary terms or in terms of popularity, ratings and fame.

2. Packages

Following packages will be used in the analysis:

  1. tidyverse: set of packages that work in harmony to make it easy to install and load multiple ‘tidyverse’ packages in a single step
  2. ggplot2: initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
  3. dplyr: dplyr provides a flexible grammar of data manipulation. It’s the next iteration of plyr, focused on tools for working with data frames.
  4. DT: Data objects in R can be rendered as HTML tables using this package.
  5. tidyr: ‘tidyr’ contains tools for changing the shape (pivoting) and hierarchy (nesting and ‘unnesting’) of a dataset, turning deeply nested lists into rectangular data frames (‘rectangling’), and extracting values out of string columns. It also includes tools for working with missing values (both implicit and explicit).
  6. kableExtra: simplifies the way to manipulate the HTML or ‘LaTeX’ codes generated by ‘kable()’ and allows users to construct complex tables and customize styles using a readable syntax.
  7. lubridate: has a consistent and memorable syntax that makes working with dates easy and fun.
library(tidyverse)
library(ggplot2)
library(dplyr)
library(DT)
library(tidyr)
library(kableExtra)
library(lubridate)

3. Data Preparation

3.1 Data Source

The dataset used in the study can be found here –> Original dataset.

3.2 Original Dataset

First step is to import the dataset into R-studio.

tidy_anime <- read.csv("tidy_anime.csv", stringsAsFactors = FALSE, header = TRUE)

The orignal dataset has following variables:

  1. animeID(int): Anime ID (as in https://myanimelist.net/anime/animeID)
  2. name (character): Describes the title of the Anime.
  3. title_english (character): title in English (sometimes is different, sometimes is missing)
  4. title_japanese (character): title in Japanese (if Anime is Chinese or Korean, the title, if available, in the respective language)
  5. title_synonyms (character): other variants of the title
  6. type (character): anime type (e.g. TV, Movie, OVA)
  7. source (character): source of anime (i.e original, manga, game, music, visual novel etc.)
  8. producers (character): producers
  9. genre(character): genre
  10. studio(character): studio
  11. episodes(double): number of episodes
  12. status(character): Finished airing or currently airing
  13. airing (logical): True/False respectively if it is still airing or has finished airing
  14. start_date (character): Start date (ymd)
  15. end_date (character): End date (ymd)
  16. duration(character): Per episode duration or entire duration, text string
  17. rating(character): Age rating
  18. score(numeric): (higher = better)
  19. scored_by(int): Number of users that scored
  20. rank(int): Rank - weight according to MyAnimeList formula
  21. popularity(int): based on how many members/users have the respective anime in their list
  22. members(int): number members that added this anime in their list
  23. favorites(int): number members that favorites these in their list
  24. synopsis(character): long string with anime synopsis
  25. background (character): long string with production background and other things
  26. premiered (character): anime premiered on season/year
  27. broadcast (character): when is (regularly) broadcasted
  28. related (character): dictionary: related animes, series, games etc.

3.3 Data Cleaning

Before I proceed with data cleaning, I need to understand the dataset, its structure, variable types, missing values, duplicate values etc. I will be doing that in following steps:

Step 1: STUDYING THE BASIC DIMENSIONS OF DATASET:

I will check the dimesions of the dataset using the below code.

Observations:

  • The dataset has 28 variables and 77911 observations.
dim(tidy_anime)

Step 2: DROPPING IRRELEVANT COLUMNS:

I will look at the different columns associated with the dataset and then drop the ones I feel are not relevant for the study.

names(tidy_anime)

Observations:

  • There are certain variables like Japanese title, English title, synopsis etc which I don’t think are relevant to my study. I will drop these variables. Later during the course of analysis, I might need to drop other variables too.
  • The dataset now has 22 columns as compared to 28 in the original dataset
anime_relevant <- select(tidy_anime,-c(3, 4 ,5, 24, 25, 28 ))
dim(anime_relevant) # Verifying the changes

Step 3: CHANGING VARIABLE NAMES:

Observation:

  • There are certain variable names which are not descriptive of the variable, thus I will change their names to more appropriate names using the below code.
names(anime_relevant)[1]<-"anime_ID"
names(anime_relevant)[9]<-"airing_status"
names(anime_relevant)[14]<-"age_group"
names(anime_relevant)[15]<-"anime_rating"
names(anime_relevant)[18]<-"popularity_score"
names(anime_relevant)[19]<-"member_count"
names(anime_relevant)[20]<-"favourite_count"
names(anime_relevant)[21]<-"season_premiered"
names(anime_relevant)[22]<-"broadcast_timing"

   
colnames(anime_relevant) # Verifying that the names are changed

Step 4: CHANGING VARIABLE TYPE:

I will check if all the variable types are appropriate and then change the variable type if required.

str(anime_relevant)

Observations:

  • Certain variables don’t have the appropriate data type and I will change them to more appropriate type.
anime_relevant$start_date <- as.Date(anime_relevant$start_date)
anime_relevant$end_date <- as.Date(anime_relevant$end_date)
anime_relevant$age_group <- as.factor(anime_relevant$age_group)
anime_relevant$type <- as.factor(anime_relevant$type)
anime_relevant$genre <- as.factor(anime_relevant$genre)
anime_relevant$airing_status <- as.factor(anime_relevant$airing_status)

str(anime_relevant) #Verifying that the data type is updated

Step 5: SEPARATING COLUMNS:

Columns Premiered, start_date, end_date, season_premiered, broadcast_timing can be split for the ease of analysis.

Observations:

  • After splitting the above mentioned columns, the dataset now has 28 columns.
anime_separated <- anime_relevant %>% 
  separate(start_date, c("start_year", "start_month", "start_day"))    %>% 
  separate(end_date, c("end_year", "end_month", "end_day"))   %>%  
  separate(season_premiered, c("premiered_season", "premiered_Year")) %>%  
  separate(broadcast_timing, c("broadcast_day", "blank1", "broadcast_time", "blank2"), sep = " " ) %>% 
      select(-c(blank1, blank2))
  
 
colnames(anime_separated) #Verifying if the columns are seperated
dim(anime_separated) #Verifying number of columns

Step 6: HANDLING DUPLICATE OBSERVATIONS:

Next I will check if there are any duplicate rows and remove them accordingly.

Observations:

  • The resulting dataset has the same number of observations as the original dataset. This means that there are no duplicate observations in the data.
anime_clean <- unique(anime_separated)
dim(anime_clean) 

Step 7: HANDLING MISSING VALUES:

I will check the number of missing values for each column and then decide how to deal with missing values if any.

Observations:

  • start_day, anime_relevant, episodes, start_year, genre, start_month have some missing values but the number is quite small compared to the size of the dataset so these can be left as such.
  • studio, end_year, producers, end_day have a high number of missing values but we cannot delete these observations as it will lead to critical loss of information. These don’t appear much relevant for the study and we might have to delete them later in the study, but for now we will leave them as such.
  • Premiered year, premiered_season have 36248 missing values, but these values can be imputed using the start date. I will be doing that in the next step.
colSums(is.na(anime_clean))

Step 8: IMPUTING MISSING VALUES FOR PREMIERED SEASON AND YEAR COLUMNS:

Before I impute the missing values, I need to change the data type for ‘start_month’ and ‘start_year’ to numeric. After changing the variable type, I will proceed with the imputation of missing values.

Observations:

  • Imputing the missing values for premiered season and year leaves them with 238 missing values which is the number of missing values in start year and start month columns.
anime_clean$start_month <- as.integer(anime_clean$start_month) #changing data type for start month

anime_clean$start_year <- as.integer(anime_clean$start_year)  #changing data type for start year

anime_clean$premiered_season <- ifelse(anime_clean$start_month %in% c(3,4,5), "Spring",
                                               ifelse(anime_clean$start_month %in% c(6,7,8), "Summer",
                                                      ifelse(anime_clean$start_month %in% c(9,10,11), "Fall",
                                                             ifelse(anime_clean$start_month %in% c(12,1,2), "Winter",
                                                                  no = NA
                                                             )))) # Imputing missing values for premiered month

anime_clean$premiered_Year <- anime_clean$start_year # # Imputing missing values for premiered year

colSums(is.na(anime_clean)) #Verifying the changes

Step 9: REMOVING ADDITIONAL COLUMNS:

Since start date and premiered season indicate the same meaning, I will drop one of these variables. I will remove start day, month and year and will keep premiered season and year. I will also drop the column ‘airing’ as it is the same as airing_status.

Observations:

  • The resulting dataset has 23 columns and 77911 observations.
anime_final <- select(anime_clean,-c( 10, 11, 12, 13))

colnames(anime_final)

dim(anime_tidy_clean)

Step 10: CHECKING UNIQUE VALUES IN EACH COLUMN AND CHANGING ANY VALUES IF REQUIRED:

Next I will check the unique values in each column to determine of there are any values that need to be changed for more uniformity in the data.

Observations:

  • The type column has certain values as ‘unknown’. I will change them to NA as these represent missing data.
  • The Broadcast_time column has both ‘Unknown’ and ‘NA’. I will change the ‘Unknown’ to NA for imporves consistency.
unique(anime_final$anime_ID)
unique(anime_final$name)
unique(anime_final$type)
unique(anime_final$genre)
unique(anime_final$source)
unique(anime_final$producers)
unique(anime_final$episodes)
unique(anime_final$airing_status)
unique(anime_final$end_year)
unique(anime_final$end_month)
unique(anime_final$end_day)
unique(anime_final$duration)
unique(anime_final$anime_rating)
unique(anime_final$premiered_season)
unique(anime_final$premiered_Year)
unique(anime_final$broadcast_day)
unique(anime_final$broadcast_time)


anime_final$broadcast_time[anime_final$broadcast_time == "Unknown"] <- NA
anime_final$type[anime_final$type == "Unknown"] <- NA

Step 11: CHECKING FOR OUTLIERS:

Observation:

  • There are no values in the dataset which can be considered as outliers. Our dataset is now clean for the next steps.
summary(anime_final)

3.4 Cleaned Dataset

The top 100 observations of the final cleaned dataset can be found below in an interactive table.

output_data <- head(anime_final, n=100)

datatable(output_data, filter = 'top', options = list(pageLength = 25))

3.5 Summary of Variables

The final dataset that I will be using in the study has following variables:

  1. anime_ID (Integer): Anime ID (as in https://myanimelist.net/anime/animeID)
  2. name (Character): Describes the title of the Anime.
  3. type (Factor): anime type (e.g. TV, Movie, OVA)
  4. source (Character): source of anime (i.e original, manga, game, music, visual novel etc.)
  5. producers (Character): producers
  6. genre(Factor): genre
  7. studio(Character): studio
  8. episodes(Integer): number of episodes
  9. airing_status(Factor): Finished airing or currently airing
  10. end_day (Date): End date
  11. end_month (Date): End month
  12. end_year (Date): End year
  13. duration (Character): Per episode duration or entire duration, text string
  14. age_group (Factor): Age rating
  15. anime_rating (Numeric): (higher = better)
  16. scored_by (Integer): Number of users that scored
  17. rank (Integer): Rank - weight according to MyAnimeList formula
  18. popularity_score (Integer): based on how many members/users have the respective anime in their list
  19. member_count (Integer): number members that added this anime in their list
  20. favourite_count (Integer): number members that favorites these in their list
  21. premiered_season (Character):anime premiered on season
  22. premiered_year (Character): anime premiered on year
  23. broadcast_day (Character): which day it is (regularly) broadcasted
  24. broadcast_time (Character): what time it is (regularly) broadcasted

4. Exploratory Data Analysis

Proposed Approach

I will be analyzing the popularity, ranking, member count, favourite count, rating by genre, age category, type, premiered season, broadcast time etc. to answer following questions:

  1. In which age group are animes most popular?
  2. Do Broadcast time and premier season have any effect on Anime popularity?
  3. Which genres are the most popular and highly rated?
  4. What type of animes are most popular?
  5. Are there any other factors which don’t appear relevant but in reality, could decide the popularity of Anime. For example could the popularity be related to airing_status or studio?

To complete this anlaysis I will be using ggplots, hisotgrams, etc. and performing regression anlysis.