MyAnimeList Exploratory Data Analysis and Modelling

MyAnimeList Exploratory Data Analysis

This project is an exploration of the MyAnimeList dataset provided on Kaggle.com. The .csv file and the other plots and their codes can be found on Github.

Importing the data

To start things off, We first import the data into R and observe its’ structure:

anime_data <- fread("dataanime.csv")
str(anime_data)

## Classes 'data.table' and 'data.frame':   1563 obs. of  20 variables:
##  $ Title          : chr  "Fullmetal Alchemist: Brotherhood" "Kimi no Na wa." "GintamaÂ°" "Steins;Gate 0" ...
##  $ Type           : chr  "TV" "Movie" "TV" "TV" ...
##  $ Episodes       : chr  "64" "1" "51" "23" ...
##  $ Status         : chr  "Finished Airing" "Finished Airing" "Finished Airing" "Currently Airing" ...
##  $ Start airing   : chr  "2009-4-5" "2016-8-26" "2015-4-8" "2018-4-12" ...
##  $ End airing     : chr  "2010-7-4" "-" "2016-3-30" "-" ...
##  $ Starting season: chr  "Spring" "-" "Spring" "Spring" ...
##  $ Broadcast time : chr  "Sundays at 17:00 (JST)" "-" "Wednesdays at 18:00 (JST)" "Thursdays at 01:35 (JST)" ...
##  $ Producers      : chr  "Aniplex,Square Enix,Mainichi Broadcasting System,Studio Moriken" "Kadokawa Shoten,Toho,Sound Team Don Juan,Lawson HMV Entertainment,Amuse,East Japan Marketing & Communications" "TV Tokyo,Aniplex,Dentsu" "Nitroplus" ...
##  $ Licensors      : chr  "Funimation,Aniplex of America" "Funimation,NYAV Post" "Funimation,Crunchyroll" "Funimation" ...
##  $ Studios        : chr  "Bones" "CoMix Wave Films" "Bandai Namco Pictures" "White Fox" ...
##  $ Sources        : chr  "Manga" "Original" "Manga" "Visual novel" ...
##  $ Genres         : chr  "Action,Military,Adventure,Comedy,Drama,Magic,Fantasy,Shounen" "Supernatural,Drama,Romance,School" "Action,Comedy,Historical,Parody,Samurai,Sci-Fi,Shounen" "Sci-Fi,Thriller" ...
##  $ Duration       : chr  "24 min. per ep." "1 hr. 46 min." "24 min. per ep." "23 min. per ep." ...
##  $ Rating         : chr  "R" "PG-13" "R" "PG-13" ...
##  $ Score          : num  9.25 9.19 9.16 9.16 9.14 9.11 9.11 9.11 9.1 9.07 ...
##  $ Scored by      : int  719706 454969 70279 12609 552791 28452 90758 395162 26284 62582 ...
##  $ Members        : int  1176368 705186 194359 186331 990419 121772 212238 705225 80166 121612 ...
##  $ Favorites      : int  105387 33936 5597 1117 90365 8370 4533 63324 1961 1498 ...
##  $ Description    : chr  "\"\"In order for something to be obtained, something of equal value must be lost.\"\"\r\n\r\nAlchemy is bound b"| __truncated__ "Mitsuha Miyamizu, a high school girl, yearns to live the life of a boy in the bustling city of Tokyoâ\200”a dre"| __truncated__ "Gintoki, Shinpachi, and Kagura return as the fun-loving but broke members of the Yorozuya team! Living in an al"| __truncated__ "The dark untold story of Steins;Gate that leads with the eccentric mad scientist Okabe, struggling to recover f"| __truncated__ ...
##  - attr(*, ".internal.selfref")=<externalptr>

We can see that we have a dataframe with more than 1500 anime. Each of them has information about their Title, Score, Sources, Broadcast time etc.

Data Cleaning

At first, we change the categorical columns into factors:

Next, we convert the date columns into the proper date format in R:

anime_data$`Start airing` <-
    as.Date(anime_data$`Start airing`, "%Y-%m-%d")
anime_data$`End airing` <-
    as.Date(anime_data$`End airing`, "%Y-%m-%d")

Now we will see what percentage of our data in each column is missing(NA) in a plot:

Exploratory Data Analysis

In this part we ask a series of important questions about the data and try to answer them with a graphical representation of our data:

Which type of anime do we have more of?

In this plot we can clearly see that most of the anime on MyAnimeList is a TV series followed by anime movies and then OVAs and Specials.

Which type of anime is more popular? series or movies?

Here we see that the difference between the mean score of anime series and anime movies is really insignificant and we cannot conclude that the type of anime being a series or a movie has a meaningful relationship with its’ score.

Which status of anime series is more popular?

Currently Airing Series have higher scores but by a really thing margin which again does not allow us to assume that there is a meaningful difference between scores of currently airing and finished anime series.

Which years were the best for starting an anime series?

Here we see that other than a few outliers in the years before 2000 where we have less data the general trend is upward and anime has been getting more and more popular by year.

Which years were the best for ending an anime series?

Similar to the previous part, we see that on average newer anime have a higher score than older anime.

Which year was the best for releasing an anime movie?

We can see the same trend of rising scores for anime movies as well as anime series but there are a lot more ups and downs due to the number of anime movies being less than anime series in each year but the general trend is still upwards.

Which season is most filled with anime?

As shown in the plot Spring and Fall have a lot more anime than Winter and Summer, which could have many different reasons related to Japanese TV schedules, media culture, school season etc.

Which season is best for starting an anime series?

Here we see that although the number of running anime changes between seasons the score they receive does fluctuate that much as the average score for Summer anime is only 0.05 less than other seasons. But as anime are also watched on DVDs and streaming servies later on those kinds of viewers could affect the score too.

Which day of the week is the most anime heavy?

The data here shows us what we would expect of it. There are a lot more anime airing over the weekend than on any other day of the week.

Which day of the week is better for anime broadcast time?

As seen in the plot, there’s not a meaningful correlation between the airing day of an anime and its’ final score.

Which time slot has the most amount of anime?

There seems to be a lot of variety in the time slot of anime which means that on almost any hour you could go to a Japanese TV channel and find some anime you can watch!

Which time slot is the best for anime?

The variety in the scores by time slot shows us that it might be a good predictor for the score of an anime.

What percentage of anime come from a manga?

Most of the anime aired on Japanese TV have the same source which is one of the different weekly manga publications. after the manga source we can see that there are a lot of original anime series and after that anime based on light novels and novels.

How does the source affect an the popularity of an anime?

The highest average score per source belongs to visual novels and web manga respectively, however it can be due to the limited number of observations biasing the data or because visual novel fans are much more dedicated to their franchises but we cannot state anything more than that.

How does the rating effect popularity?

The data shows us that there’s not that much of a difference in scores between different ratings for anime, although anime rated R seem to have a slight advantage over other ratings but as we have a lot of missing values, we cannot state anything more.

Modelling

Below we have provided a multivariate regression model on the anime series score based on factors such as number of episodes, starting season, source, rating etc.

anime_series_data$Episodes <- as.numeric(anime_series_data$Episodes)
anime_series_data <- na.omit(anime_series_data)
score_fit <-
    lm(Score ~ Episodes + `Starting season` + Sources + Rating + weekday + 
        day_time + Favorites + Members + `Scored by` ,
        data = anime_series_data)
summary(score_fit)

## 
## Call:
## lm(formula = Score ~ Episodes + `Starting season` + Sources + 
##     Rating + weekday + day_time + Favorites + Members + `Scored by`, 
##     data = anime_series_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69642 -0.19219 -0.00105  0.18152  1.05289 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              7.808e+00  2.256e-01  34.607  < 2e-16 ***
## Episodes                -1.188e-03  4.629e-04  -2.567 0.010623 *  
## `Starting season`Spring  2.743e-02  4.165e-02   0.659 0.510594    
## `Starting season`Summer -9.543e-02  5.016e-02  -1.903 0.057840 .  
## `Starting season`Winter  4.635e-02  4.811e-02   0.963 0.335999    
## Sources4-koma manga     -7.777e-02  2.236e-01  -0.348 0.728156    
## SourcesBook              1.522e-01  3.078e-01   0.494 0.621361    
## SourcesGame             -1.691e-01  2.372e-01  -0.713 0.476439    
## SourcesLight novel       8.928e-02  2.112e-01   0.423 0.672709    
## SourcesManga             1.861e-01  2.043e-01   0.911 0.362934    
## SourcesMusic            -2.743e-01  3.933e-01  -0.697 0.486068    
## SourcesNovel             7.926e-02  2.062e-01   0.384 0.700848    
## SourcesOriginal          1.242e-02  2.059e-01   0.060 0.951919    
## SourcesOther             8.318e-02  2.709e-01   0.307 0.758947    
## SourcesVisual novel      1.151e-01  2.322e-01   0.496 0.620445    
## SourcesWeb manga         3.055e-01  2.342e-01   1.305 0.192819    
## RatingPG                -3.687e-02  1.430e-01  -0.258 0.796650    
## RatingPG-13              5.244e-02  8.167e-02   0.642 0.521234    
## RatingR                  7.648e-02  9.004e-02   0.849 0.396170    
## weekdayMonday           -2.996e-03  7.569e-02  -0.040 0.968444    
## weekdaySaturday          6.095e-02  6.522e-02   0.935 0.350588    
## weekdaySunday            6.637e-02  6.668e-02   0.995 0.320174    
## weekdayThursday          3.897e-02  7.290e-02   0.535 0.593288    
## weekdayTuesday           2.579e-02  6.615e-02   0.390 0.696871    
## weekdayWednesday         8.204e-02  7.716e-02   1.063 0.288340    
## day_time00:10           -2.410e-01  3.373e-01  -0.715 0.475271    
## day_time00:30           -8.018e-02  9.684e-02  -0.828 0.408224    
## day_time00:35           -1.420e-01  2.473e-01  -0.574 0.566317    
## day_time00:45            7.470e-02  1.163e-01   0.642 0.521000    
## day_time00:50            8.397e-02  1.443e-01   0.582 0.560890    
## day_time00:55           -1.793e-01  1.674e-01  -1.071 0.284671    
## day_time00:56           -4.813e-01  3.771e-01  -1.276 0.202578    
## day_time00:59           -1.098e-01  2.496e-01  -0.440 0.660230    
## day_time01:00           -1.039e-01  1.161e-01  -0.895 0.371302    
## day_time01:05           -3.812e-02  1.085e-01  -0.351 0.725478    
## day_time01:10            1.491e-01  3.418e-01   0.436 0.662995    
## day_time01:15           -1.305e-01  1.270e-01  -1.028 0.304751    
## day_time01:20            6.293e-02  2.492e-01   0.253 0.800736    
## day_time01:23           -2.806e-01  2.466e-01  -1.138 0.255841    
## day_time01:25           -4.364e-02  1.287e-01  -0.339 0.734708    
## day_time01:28            2.144e-01  3.401e-01   0.630 0.528829    
## day_time01:29            1.001e-01  1.538e-01   0.651 0.515733    
## day_time01:30           -2.388e-01  9.313e-02  -2.565 0.010701 *  
## day_time01:35            1.904e-01  1.095e-01   1.739 0.082790 .  
## day_time01:45           -5.310e-01  3.404e-01  -1.560 0.119563    
## day_time01:50           -2.795e-01  3.369e-01  -0.830 0.407315    
## day_time01:55           -1.891e-01  1.125e-01  -1.681 0.093536 .  
## day_time01:58           -1.306e-02  1.529e-01  -0.085 0.932004    
## day_time01:59            5.301e-02  2.559e-01   0.207 0.836020    
## day_time02:00           -2.950e-02  1.639e-01  -0.180 0.857286    
## day_time02:05           -1.781e-01  1.405e-01  -1.268 0.205504    
## day_time02:08            2.722e-01  3.363e-01   0.810 0.418706    
## day_time02:10           -1.608e-01  2.039e-01  -0.789 0.430748    
## day_time02:12            1.189e-01  3.436e-01   0.346 0.729586    
## day_time02:13           -3.570e-01  3.486e-01  -1.024 0.306340    
## day_time02:15           -3.894e-02  3.544e-01  -0.110 0.912578    
## day_time02:16           -3.404e-01  3.403e-01  -1.000 0.317836    
## day_time02:19            3.098e-01  3.391e-01   0.914 0.361501    
## day_time02:20           -4.283e-02  2.037e-01  -0.210 0.833571    
## day_time02:21            1.064e-01  3.386e-01   0.314 0.753630    
## day_time02:25           -1.236e-01  1.401e-01  -0.882 0.378323    
## day_time02:28           -3.877e-01  1.848e-01  -2.098 0.036512 *  
## day_time02:30            3.641e-01  3.544e-01   1.027 0.304834    
## day_time02:35           -2.370e-01  2.016e-01  -1.175 0.240516    
## day_time02:40           -3.381e-01  3.553e-01  -0.952 0.341918    
## day_time02:55           -6.029e-01  3.376e-01  -1.786 0.074940 .  
## day_time02:58           -1.155e-01  2.027e-01  -0.570 0.569252    
## day_time03:08           -1.274e-01  3.386e-01  -0.376 0.706918    
## day_time03:10           -3.489e-01  3.398e-01  -1.027 0.305194    
## day_time03:40            1.070e-01  3.446e-01   0.310 0.756420    
## day_time06:30           -4.792e-01  3.385e-01  -1.416 0.157604    
## day_time07:00           -2.401e-02  2.062e-01  -0.116 0.907361    
## day_time07:30           -7.582e-02  3.389e-01  -0.224 0.823097    
## day_time08:05            1.425e-01  3.385e-01   0.421 0.674093    
## day_time08:06            1.069e-01  3.435e-01   0.311 0.755857    
## day_time08:30           -1.036e-01  1.706e-01  -0.608 0.543735    
## day_time09:00           -2.197e-01  1.492e-01  -1.472 0.141747    
## day_time09:20            7.109e-01  3.910e-01   1.818 0.069803 .  
## day_time09:30            9.093e-02  3.405e-01   0.267 0.789603    
## day_time10:00            6.444e-03  2.959e-01   0.022 0.982638    
## day_time10:30           -1.247e-01  1.900e-01  -0.656 0.512274    
## day_time10:55            3.869e-02  3.592e-01   0.108 0.914278    
## day_time12:00           -1.672e-01  3.626e-01  -0.461 0.644958    
## day_time17:00           -9.917e-02  1.108e-01  -0.895 0.371193    
## day_time17:30           -4.841e-02  1.296e-01  -0.374 0.708868    
## day_time17:55            2.473e-01  3.766e-01   0.657 0.511775    
## day_time18:00           -4.523e-03  9.250e-02  -0.049 0.961026    
## day_time18:25           -2.002e-01  2.526e-01  -0.793 0.428488    
## day_time18:30           -2.493e-01  1.114e-01  -2.238 0.025819 *  
## day_time18:55            2.043e-01  3.713e-01   0.550 0.582503    
## day_time19:00           -9.376e-02  1.081e-01  -0.868 0.386195    
## day_time19:30           -1.651e-01  1.072e-01  -1.540 0.124470    
## day_time20:00           -4.607e-01  3.411e-01  -1.351 0.177588    
## day_time20:30            6.263e-01  3.400e-01   1.842 0.066192 .  
## day_time21:00           -3.661e-01  2.442e-01  -1.499 0.134599    
## day_time21:30            6.164e-02  1.455e-01   0.424 0.671997    
## day_time22:00           -2.521e-01  1.048e-01  -2.405 0.016647 *  
## day_time22:30           -2.400e-01  1.147e-01  -2.091 0.037144 *  
## day_time22:55           -4.056e-01  3.383e-01  -1.199 0.231287    
## day_time23:00           -7.814e-02  1.072e-01  -0.729 0.466297    
## day_time23:15            1.292e-01  3.452e-01   0.374 0.708391    
## day_time23:17           -1.121e-01  3.483e-01  -0.322 0.747768    
## day_time23:30           -2.565e-01  1.022e-01  -2.509 0.012528 *  
## day_time23:45            4.942e-01  3.386e-01   1.459 0.145255    
## Favorites                1.766e-05  2.900e-06   6.088 2.74e-09 ***
## Members                  1.615e-06  4.863e-07   3.320 0.000985 ***
## `Scored by`             -2.840e-06  7.646e-07  -3.714 0.000234 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.326 on 389 degrees of freedom
## Multiple R-squared:  0.4507, Adjusted R-squared:  0.301 
## F-statistic: 3.011 on 106 and 389 DF,  p-value: 3.266e-15

As you can see in the summary our R-squared has a value lower than 0.5 which could be alarming in predicting precise, physical processes but as peoples’ taste in media and how they subjectively view TV series are relatively less predictable than phenomena with precise rules it can be accepted as a good model. The factors on which we have run a regression are the result of our exploratory analysis which showed us which of them were important to the final score.

Future Projects

There are a lot more insights hidden in this dataset which can help discover new findings about the world of anime. For example, studying the relationship between genre and rating as well as genre and score, determining the rating based on the description using sentiment analysis. There is also some value in establishing which studios were the most successful comparing to the others.

MyAnimeList Exploratory Data Analysis and Modelling

Hossein FaridNasr

8/14/2021