The word anime is the Japanese term for animation, which means all forms of animated media. Outside Japan, anime refers specifically to animation from Japan or as a Japanese-disseminated animation style often characterized by colorful graphics, vibrant characters and fantastical themes. The anime industry consists of over 430 production studios, including major names like Studio Ghibli, Gainax, and Toei Animation. Despite comprising only a fraction of Japan’s domestic film market, anime makes up a majority of Japanese DVD sales. It has also seen international success after the rise of English-dubbed programming. This rise in international popularity has resulted in non-Japanese productions using the anime art style. Whether these works are anime-influenced animation or proper anime is a subject for debate amongst fans.Japanese anime accounts for 60% of the world’s animated television shows, as of 2016.
Problem Statement:
The objective of this project is to explore the Anime dataset to answer interesting questions like what factors affect the ratings and popularity of an Anime.
Approach
Various factors like age category, broadcast time, premiered season, genre, type etc. will be analyzed to predict the popularity of an Anime. I will be dropping the variables that look irrelevant as a part of cleaning. During the course of analysis, I will be dropping the variables which show no co-relation with the popularity, ratings or rankings of the anime. This approach can help me compare the relevant variables and help me derieve conclusions of what factors decide the popularity of an Anime. This can help producers in getting maximum benfits out of new animes, be it in monetary terms or in terms of popularity, ratings and fame.
Following packages will be used in the analysis:
library(tidyverse)
library(ggplot2)
library(dplyr)
library(DT)
library(tidyr)
library(kableExtra)
library(lubridate)
The dataset used in the study can be found here –> Original dataset.
First step is to import the dataset into R-studio.
tidy_anime <- read.csv("tidy_anime.csv", stringsAsFactors = FALSE, header = TRUE)
The orignal dataset has following variables:
Before I proceed with data cleaning, I need to understand the dataset, its structure, variable types, missing values, duplicate values etc. I will be doing that in following steps:
I will check the dimesions of the dataset using the below code.
Observations:
dim(tidy_anime)
I will look at the different columns associated with the dataset and then drop the ones I feel are not relevant for the study.
names(tidy_anime)
Observations:
anime_relevant <- select(tidy_anime,-c(3, 4 ,5, 24, 25, 28 ))
dim(anime_relevant) # Verifying the changes
Observation:
names(anime_relevant)[1]<-"anime_ID"
names(anime_relevant)[9]<-"airing_status"
names(anime_relevant)[14]<-"age_group"
names(anime_relevant)[15]<-"anime_rating"
names(anime_relevant)[18]<-"popularity_score"
names(anime_relevant)[19]<-"member_count"
names(anime_relevant)[20]<-"favourite_count"
names(anime_relevant)[21]<-"season_premiered"
names(anime_relevant)[22]<-"broadcast_timing"
colnames(anime_relevant) # Verifying that the names are changed
I will check if all the variable types are appropriate and then change the variable type if required.
str(anime_relevant)
Observations:
anime_relevant$start_date <- as.Date(anime_relevant$start_date)
anime_relevant$end_date <- as.Date(anime_relevant$end_date)
anime_relevant$age_group <- as.factor(anime_relevant$age_group)
anime_relevant$type <- as.factor(anime_relevant$type)
anime_relevant$genre <- as.factor(anime_relevant$genre)
anime_relevant$airing_status <- as.factor(anime_relevant$airing_status)
str(anime_relevant) #Verifying that the data type is updated
Columns Premiered, start_date, end_date, season_premiered, broadcast_timing can be split for the ease of analysis.
Observations:
anime_separated <- anime_relevant %>%
separate(start_date, c("start_year", "start_month", "start_day")) %>%
separate(end_date, c("end_year", "end_month", "end_day")) %>%
separate(season_premiered, c("premiered_season", "premiered_Year")) %>%
separate(broadcast_timing, c("broadcast_day", "blank1", "broadcast_time", "blank2"), sep = " " ) %>%
select(-c(blank1, blank2))
colnames(anime_separated) #Verifying if the columns are seperated
dim(anime_separated) #Verifying number of columns
Next I will check if there are any duplicate rows and remove them accordingly.
Observations:
anime_clean <- unique(anime_separated)
dim(anime_clean)
I will check the number of missing values for each column and then decide how to deal with missing values if any.
Observations:
colSums(is.na(anime_clean))
Before I impute the missing values, I need to change the data type for ‘start_month’ and ‘start_year’ to numeric. After changing the variable type, I will proceed with the imputation of missing values.
Observations:
anime_clean$start_month <- as.integer(anime_clean$start_month) #changing data type for start month
anime_clean$start_year <- as.integer(anime_clean$start_year) #changing data type for start year
anime_clean$premiered_season <- ifelse(anime_clean$start_month %in% c(3,4,5), "Spring",
ifelse(anime_clean$start_month %in% c(6,7,8), "Summer",
ifelse(anime_clean$start_month %in% c(9,10,11), "Fall",
ifelse(anime_clean$start_month %in% c(12,1,2), "Winter",
no = NA
)))) # Imputing missing values for premiered month
anime_clean$premiered_Year <- anime_clean$start_year # # Imputing missing values for premiered year
colSums(is.na(anime_clean)) #Verifying the changes
Since start date and premiered season indicate the same meaning, I will drop one of these variables. I will remove start day, month and year and will keep premiered season and year. I will also drop the column ‘airing’ as it is the same as airing_status.
Observations:
anime_final <- select(anime_clean,-c( 10, 11, 12, 13))
colnames(anime_final)
dim(anime_tidy_clean)
Next I will check the unique values in each column to determine of there are any values that need to be changed for more uniformity in the data.
Observations:
unique(anime_final$anime_ID)
unique(anime_final$name)
unique(anime_final$type)
unique(anime_final$genre)
unique(anime_final$source)
unique(anime_final$producers)
unique(anime_final$episodes)
unique(anime_final$airing_status)
unique(anime_final$end_year)
unique(anime_final$end_month)
unique(anime_final$end_day)
unique(anime_final$duration)
unique(anime_final$anime_rating)
unique(anime_final$premiered_season)
unique(anime_final$premiered_Year)
unique(anime_final$broadcast_day)
unique(anime_final$broadcast_time)
anime_final$broadcast_time[anime_final$broadcast_time == "Unknown"] <- NA
anime_final$type[anime_final$type == "Unknown"] <- NA
Observation:
summary(anime_final)
The top 100 observations of the final cleaned dataset can be found below in an interactive table.
output_data <- head(anime_final, n=100)
datatable(output_data, filter = 'top', options = list(pageLength = 25))
The final dataset that I will be using in the study has following variables:
Proposed Approach
I will be analyzing the popularity, ranking, member count, favourite count, rating by genre, age category, type, premiered season, broadcast time etc. to answer following questions:
To complete this anlaysis I will be using ggplots, hisotgrams, etc. and performing regression anlysis.