For my final project, I focused on manga, Japanese comic books or graphic novels. Every year, there are lots of manga that come out of Japan either brand-new or continuing from the years prior. While some manga succeed or stay succeeding, a lot of other manga are cancelled or left on hiatus early in its run. This could be due to poor writing or story, howver, it can also be caused by factors out of the manga authors’ hands. Using data, my goal is to find out what variables makes one manga sell better than another and to see if I can predict how success a new manga will be.
My motivation for this project is that I am a big manga reader and while I love/read some of the most popular titles, I prefer to find up-and-coming or new manga. However, this has lent to many disappointments where they get cancelled way before the story can unfold. With this analyzation, I hope to be able to predict the success of a manga or the likelihood of it being cancelled prematurely to pick manga to read/get invested in.
I will be using three datasets from two sources. The first source is from Kaggle (https://www.kaggle.com/datasets/andreuvallhernndez/myanimelist). This is published by Andreu Vall Hernàndez and contains two datasets that take in anime and manga data from myAnimeList: a website known in the manga/anime western community as the best/biggest database for anime/manga and where a lot of people go to rate them. There was 64,833 rows in this manga dataset and I mostly used this dataset to get specific/more information on different mangas i.e. genres, themes, start_date, etc. The second source is also from Kaggle (https://www.kaggle.com/datasets/drahulsingh/best-selling-manga) by D Rahulsingh. This dataset holds the best-selling manga of all time (~187) and contains sales numbers.
myanimelist_manga <- read.csv("myanimelist/manga.csv")
head(myanimelist_manga, 1)
## manga_id title type score scored_by status volumes chapters
## 1 2 Berserk manga 9.47 319696 currently_publishing NA NA
## start_date end_date members favorites sfw approved created_at_before
## 1 1989-08-25 643969 119470 True True 2007-07-17 20:14:45+00:00
## updated_at real_start_date real_end_date
## 1 2023-04-01 00:19:31+00:00 1989-08-25
## genres
## 1 ['Action', 'Adventure', 'Award Winning', 'Drama', 'Fantasy', 'Horror', 'Supernatural']
## themes demographics
## 1 ['Gore', 'Military', 'Mythology', 'Psychological'] ['Seinen']
## authors
## 1 [{'id': 1868, 'first_name': 'Kentarou', 'last_name': 'Miura', 'role': 'Story & Art'}, {'id': 49592, 'first_name': '', 'last_name': 'Studio Gaga', 'role': 'Art'}]
## serializations
## 1 ['Young Animal']
## synopsis
## 1 Guts, a former mercenary now known as the "Black Swordsman," is out for revenge. After a tumultuous childhood, he finally finds someone he respects and believes he can trust, only to have everything fall apart when this person takes away everything important to Guts for the purpose of fulfilling his own desires. Now marked for death, Guts becomes condemned to a fate in which he is relentlessly pursued by demonic beings.\n\nSetting out on a dreadful quest riddled with misfortune, Guts, armed with a massive sword and monstrous strength, will let nothing stop him, not even death itself, until he is finally able to take the head of the one who stripped him—and his loved one—of their humanity.\n\n[Written by MAL Rewrite]\n\nIncluded one-shot:\nVolume 14: Berserk: The Prototype
## background
## 1 Berserk won the Award for Excellence at the sixth installment of Tezuka Osamu Cultural Prize in 2002. The series has over 50 million copies in print worldwide and has been published in English by Dark Horse since November 4, 2003. It is also published in Italy, Germany, Spain, France, Brazil, South Korea, Hong Kong, Taiwan, Thailand, Poland, México and Turkey. In May 2021, the author Kentaro Miura suddenly died at the age of 54. Chapter 364 of Berserk was published posthumously on September 10, 2021. Miura would often share details about the series' story with his childhood friend and fellow mangaka Kouji Mori. Berserk resumed on June 24, 2022, with Studio Gaga handling the art and Kouji Mori's supervision.
## main_picture
## 1 https://cdn.myanimelist.net/images/manga/1/157897l.jpg
## url title_english title_japanese
## 1 https://myanimelist.net/manga/2/Berserk Berserk ベルセルク
## title_synonyms
## 1 ['Berserk: The Prototype']
colnames(myanimelist_manga)
## [1] "manga_id" "title" "type"
## [4] "score" "scored_by" "status"
## [7] "volumes" "chapters" "start_date"
## [10] "end_date" "members" "favorites"
## [13] "sfw" "approved" "created_at_before"
## [16] "updated_at" "real_start_date" "real_end_date"
## [19] "genres" "themes" "demographics"
## [22] "authors" "serializations" "synopsis"
## [25] "background" "main_picture" "url"
## [28] "title_english" "title_japanese" "title_synonyms"
nrow(myanimelist_manga)
## [1] 64833
myanimelist_anime <- read.csv("myanimelist/anime.csv")
head(myanimelist_anime, 1)
## anime_id title type score scored_by
## 1 5114 Fullmetal Alchemist: Brotherhood tv 9.1 2037075
## status episodes start_date end_date source members favorites
## 1 finished_airing 64 2009-04-05 2010-07-04 manga 3206028 219036
## episode_duration total_duration rating sfw approved
## 1 0 days 00:24:20 1 days 01:57:20 r True True
## created_at updated_at start_year start_season
## 1 2008-08-21 03:35:22+00:00 2023-04-02 18:07:03+00:00 2009 spring
## real_start_date real_end_date broadcast_day broadcast_time
## 1 2009-04-05 2010-07-04 sunday 17:00:00
## genres themes demographics
## 1 ['Action', 'Adventure', 'Drama', 'Fantasy'] ['Military'] ['Shounen']
## studios
## 1 ['Bones']
## producers
## 1 ['Aniplex', 'Square Enix', 'Mainichi Broadcasting System', 'Studio Moriken']
## licensors
## 1 ['Funimation', 'Aniplex of America']
## synopsis
## 1 After a horrific alchemy experiment goes wrong in the Elric household, brothers Edward and Alphonse are left in a catastrophic new reality. Ignoring the alchemical principle banning human transmutation, the boys attempted to bring their recently deceased mother back to life. Instead, they suffered brutal personal loss: Alphonse's body disintegrated while Edward lost a leg and then sacrificed an arm to keep Alphonse's soul in the physical realm by binding it to a hulking suit of armor.\n\nThe brothers are rescued by their neighbor Pinako Rockbell and her granddaughter Winry. Known as a bio-mechanical engineering prodigy, Winry creates prosthetic limbs for Edward by utilizing "automail," a tough, versatile metal used in robots and combat armor. After years of training, the Elric brothers set off on a quest to restore their bodies by locating the Philosopher's Stone—a powerful gem that allows an alchemist to defy the traditional laws of Equivalent Exchange.\n\nAs Edward becomes an infamous alchemist and gains the nickname "Fullmetal," the boys' journey embroils them in a growing conspiracy that threatens the fate of the world.\n\n[Written by MAL Rewrite]
## background main_picture
## 1 https://cdn.myanimelist.net/images/anime/1208/94745l.jpg
## url
## 1 https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood
## trailer_url title_english
## 1 https://www.youtube.com/watch?v=--IcmZkvL0Q Fullmetal Alchemist: Brotherhood
## title_japanese
## 1 鋼の錬金術師 FULLMETAL ALCHEMIST
## title_synonyms
## 1 ['Hagane no Renkinjutsushi: Fullmetal Alchemist', 'Fullmetal Alchemist (2009)', 'FMA', 'FMAB']
colnames(myanimelist_anime)
## [1] "anime_id" "title" "type" "score"
## [5] "scored_by" "status" "episodes" "start_date"
## [9] "end_date" "source" "members" "favorites"
## [13] "episode_duration" "total_duration" "rating" "sfw"
## [17] "approved" "created_at" "updated_at" "start_year"
## [21] "start_season" "real_start_date" "real_end_date" "broadcast_day"
## [25] "broadcast_time" "genres" "themes" "demographics"
## [29] "studios" "producers" "licensors" "synopsis"
## [33] "background" "main_picture" "url" "trailer_url"
## [37] "title_english" "title_japanese" "title_synonyms"
nrow(myanimelist_anime)
## [1] 24985
bestsellingmanga <- read.csv("best-selling-manga.csv")
head(bestsellingmanga, 1)
## Manga.series Author.s. Publisher Demographic No..of.collected.volumes
## 1 One Piece Eiichiro Oda Shueisha Shōnen 104
## Serialized Approximate.sales.in.million.s.
## 1 1997–present 516.6
## Average.sales.per.volume.in.million.s.
## 1 4.97
colnames(bestsellingmanga)
## [1] "Manga.series"
## [2] "Author.s."
## [3] "Publisher"
## [4] "Demographic"
## [5] "No..of.collected.volumes"
## [6] "Serialized"
## [7] "Approximate.sales.in.million.s."
## [8] "Average.sales.per.volume.in.million.s."
nrow(bestsellingmanga)
## [1] 187
##Data Cleaning/Tidying
I first loaded in the myAnimelist datasets to myanimelist_manga & myanimelist_anime and cleaned up. This includes getting rid of any unnecessary columns (unnecessary for my analysis). I initially wanted to join them together to get a list of which mangas had anime adaptions. When I tried to join them by title, I got a ‘many to many’ relationship errors which was still there even after cleaning up the duplicate names. Then, I decide all I need for my analysis was a yes or no column: does this manga have an anime adaption. So I created myanimelist which was myanimelist_manga plus this anime_adpation column. I ran into some problems involving for loops and timing out for setting the yes or no’s in the column. Finally, I found out what I wished to achieved can be done in one function.
myanimelist_manga <- subset(myanimelist_manga, select=c(1,2,3,4,5,6,7,8,9,10,11,12,13,19,20,21,23))
colnames(myanimelist_manga)
## [1] "manga_id" "title" "type" "score"
## [5] "scored_by" "status" "volumes" "chapters"
## [9] "start_date" "end_date" "members" "favorites"
## [13] "sfw" "genres" "themes" "demographics"
## [17] "serializations"
#clean columns: end_date (fix blanks -> NA), genres?, themes?, demographics, serializations
#end date add NA
myanimelist_manga$end_date[myanimelist_manga$end_date==""] <- NA
myanimelist_manga$demographics <- str_extract(myanimelist_manga$demographics, "[A-Z]+[a-z]+")
myanimelist_manga$serializations <- str_extract(myanimelist_manga$serializations, "[A-Z]+[a-z]+")
#weird titles
myanimelist_manga$title[myanimelist_manga$title=="One Punch-Man"] <- "One Punch Man"
#myanimelist_anime clean up - this was not used too much after
myanimelist_anime <- subset(myanimelist_anime, select=c(1,2,3,4,5,6,7,8,9,10,11,12,15,29))
colnames(myanimelist_anime)
## [1] "anime_id" "title" "type" "score" "scored_by"
## [6] "status" "episodes" "start_date" "end_date" "source"
## [11] "members" "favorites" "rating" "studios"
#clean columns: end_date (fix blanks -> NA), studios
#end date add NA
myanimelist_anime$end_date[myanimelist_anime$end_date==""] <- NA
myanimelist_anime$studios <- str_extract(myanimelist_anime$studios, "[A-Z]+[a-z]+")
#get rid of dup roles for join
colnames(myanimelist_anime)[3] = "type_anime"
colnames(myanimelist_anime)[4] = "score_anime"
colnames(myanimelist_anime)[5] = "scored_by_anime"
colnames(myanimelist_anime)[6] = "status_anime"
colnames(myanimelist_anime)[8] = "start_date_anime"
colnames(myanimelist_anime)[9] = "end_date_anime"
colnames(myanimelist_anime)[11] = "members_anime"
colnames(myanimelist_anime)[12] = "favorites_anime"
myanimelist <- myanimelist_manga
myanimelist$anime_adaption <- 'No'
#talk about my for loop problem
myanimelist$anime_adaption[myanimelist$title %in% myanimelist_anime$title] <- "Yes"
For the second dataset, bestsellingmanga, I inner joined it with myanimelist_manga after renaming the Manga.series to match myanimelist_manga’s title. This gave me back a dataset of bestsellingmanga with all the extra information of myanimelist_manga.
colnames(bestsellingmanga)[1]="title"
bestsellingmanga <- inner_join(bestsellingmanga, myanimelist_manga, by = "title")
#get rid of non manga rows if type != "manga" for dups
bestsellingmanga <- bestsellingmanga[bestsellingmanga$type == 'manga', ]
head(bestsellingmanga)
## title Author.s. Publisher Demographic
## 1 One Piece Eiichiro Oda Shueisha Shōnen
## 2 Golgo 13 Takao Saito, Saito Production Shogakukan Seinen
## 3 Dragon Ball Akira Toriyama Shueisha Shōnen
## 4 Doraemon Fujiko F. Fujio Shogakukan Children
## 6 Naruto Masashi Kishimoto Shueisha Shōnen
## 8 Slam Dunk Takehiko Inoue Shueisha Shōnen
## No..of.collected.volumes Serialized Approximate.sales.in.million.s.
## 1 104 1997–present 516.6
## 2 207 1968–present 300.0
## 3 42 1984–1995 260.0
## 4 45 1969–1996 250.0
## 6 72 1999–2014 250.0
## 8 31 1990–1996 170.0
## Average.sales.per.volume.in.million.s. manga_id type score scored_by
## 1 4.97 13 manga 9.22 355375
## 2 1.45 1298 manga 7.85 854
## 3 6.19 42 manga 8.41 92616
## 4 4.71 1032 manga 8.44 6919
## 6 3.47 11 manga 8.07 264788
## 8 5.48 51 manga 9.08 70877
## status volumes chapters start_date end_date members favorites
## 1 currently_publishing NA NA 1997-07-22 <NA> 579557 111462
## 2 currently_publishing NA NA 1968-11-29 <NA> 5715 86
## 3 finished 42 520 1984-11-20 1995-05-23 151685 13965
## 4 finished 45 821 1969-12-01 1996-01-01 13837 873
## 6 finished 72 700 1999-09-21 2014-11-10 402677 43311
## 8 finished 31 276 1990-09-18 1996-06-04 157962 14970
## sfw genres
## 1 True ['Action', 'Adventure', 'Fantasy']
## 2 True ['Action', 'Adventure', 'Award Winning', 'Drama', 'Mystery']
## 3 True ['Action', 'Adventure', 'Comedy', 'Sci-Fi']
## 4 True ['Adventure', 'Award Winning', 'Comedy', 'Sci-Fi', 'Slice of Life']
## 6 True ['Action', 'Adventure', 'Fantasy']
## 8 True ['Award Winning', 'Sports']
## themes demographics serializations
## 1 [] Shounen Shounen
## 2 ['Adult Cast', 'Historical'] Seinen Big
## 3 ['Martial Arts', 'Super Power'] Shounen Shounen
## 4 ['Anthropomorphic', 'School'] Kids <NA>
## 6 ['Martial Arts'] Shounen Shounen
## 8 ['School', 'Team Sports'] Shounen Shounen
nrow(bestsellingmanga)
## [1] 99
bestsellingmanga$anime_adaption <- 'No'
bestsellingmanga$anime_adaption[bestsellingmanga$title %in% myanimelist_anime$title] <- "Yes"
head(bestsellingmanga)
## title Author.s. Publisher Demographic
## 1 One Piece Eiichiro Oda Shueisha Shōnen
## 2 Golgo 13 Takao Saito, Saito Production Shogakukan Seinen
## 3 Dragon Ball Akira Toriyama Shueisha Shōnen
## 4 Doraemon Fujiko F. Fujio Shogakukan Children
## 6 Naruto Masashi Kishimoto Shueisha Shōnen
## 8 Slam Dunk Takehiko Inoue Shueisha Shōnen
## No..of.collected.volumes Serialized Approximate.sales.in.million.s.
## 1 104 1997–present 516.6
## 2 207 1968–present 300.0
## 3 42 1984–1995 260.0
## 4 45 1969–1996 250.0
## 6 72 1999–2014 250.0
## 8 31 1990–1996 170.0
## Average.sales.per.volume.in.million.s. manga_id type score scored_by
## 1 4.97 13 manga 9.22 355375
## 2 1.45 1298 manga 7.85 854
## 3 6.19 42 manga 8.41 92616
## 4 4.71 1032 manga 8.44 6919
## 6 3.47 11 manga 8.07 264788
## 8 5.48 51 manga 9.08 70877
## status volumes chapters start_date end_date members favorites
## 1 currently_publishing NA NA 1997-07-22 <NA> 579557 111462
## 2 currently_publishing NA NA 1968-11-29 <NA> 5715 86
## 3 finished 42 520 1984-11-20 1995-05-23 151685 13965
## 4 finished 45 821 1969-12-01 1996-01-01 13837 873
## 6 finished 72 700 1999-09-21 2014-11-10 402677 43311
## 8 finished 31 276 1990-09-18 1996-06-04 157962 14970
## sfw genres
## 1 True ['Action', 'Adventure', 'Fantasy']
## 2 True ['Action', 'Adventure', 'Award Winning', 'Drama', 'Mystery']
## 3 True ['Action', 'Adventure', 'Comedy', 'Sci-Fi']
## 4 True ['Adventure', 'Award Winning', 'Comedy', 'Sci-Fi', 'Slice of Life']
## 6 True ['Action', 'Adventure', 'Fantasy']
## 8 True ['Award Winning', 'Sports']
## themes demographics serializations anime_adaption
## 1 [] Shounen Shounen Yes
## 2 ['Adult Cast', 'Historical'] Seinen Big Yes
## 3 ['Martial Arts', 'Super Power'] Shounen Shounen Yes
## 4 ['Anthropomorphic', 'School'] Kids <NA> Yes
## 6 ['Martial Arts'] Shounen Shounen Yes
## 8 ['School', 'Team Sports'] Shounen Shounen Yes
There was two columns in myanimelist_manga where the data were stored in string vectors: genres and themes. In order to explore these two, I decided to turn them into long datasets (one row per observation, one row per genre/theme). I was not going from wide to long like in our previous classwork so I could not use pivot_longer. Instead, I created a for loop through the rows that will then take and split the genres/themes column into proper vector of strings. Another for loop is added to them add that manga’s name and genre/theme (one at a time) to a new dataset: manga_genres or manga_themes.
manga_genres <- data.frame(matrix(ncol=2,nrow=0))
colnames(manga_genres) <- c('title','genre')
manga_genres
## [1] title genre
## <0 rows> (or 0-length row.names)
for (i in 1:nrow(bestsellingmanga)){
genre_manga <- str_extract_all(bestsellingmanga$genres[i],"[A-Za-z]+(-[A-Za-z]+)?( [A-Za-z]+ [A-Za-z]+)?( [A-Za-z]+)?")
for(m in genre_manga){
for(n in m){
manga_genres[nrow(manga_genres) + 1, ] = c(bestsellingmanga$title[i], n)
}
}
}
head(manga_genres)
## title genre
## 1 One Piece Action
## 2 One Piece Adventure
## 3 One Piece Fantasy
## 4 Golgo 13 Action
## 5 Golgo 13 Adventure
## 6 Golgo 13 Award Winning
nrow(manga_genres)
## [1] 295
manga_themes <- data.frame(matrix(ncol=2,nrow=0))
colnames(manga_themes) <- c('title','theme')
manga_themes
## [1] title theme
## <0 rows> (or 0-length row.names)
for (i in 1:nrow(bestsellingmanga)){
theme_manga <- str_extract_all(bestsellingmanga$themes[i],"[A-Za-z]+(-[A-Za-z]+)?( [A-Za-z]+ [A-Za-z]+)?( [A-Za-z]+)?")
for(m in theme_manga){
for(n in m){
manga_themes[nrow(manga_themes) + 1, ] = c(bestsellingmanga$title[i], n)
}
}
}
head(manga_themes)
## title theme
## 1 Golgo 13 Adult Cast
## 2 Golgo 13 Historical
## 3 Dragon Ball Martial Arts
## 4 Dragon Ball Super Power
## 5 Doraemon Anthropomorphic
## 6 Doraemon School
nrow(manga_themes)
## [1] 126
I plotted some bar graph based on the datasets I cleaned up and created below.
animelist <- data.frame(matrix(ncol=2,nrow=0))
colnames(animelist) <- c('anime_adaption','count')
animelist[nrow(animelist) + 1, ] = c('Yes', nrow(bestsellingmanga[bestsellingmanga$anime_adaption == "Yes", ]))
animelist[nrow(animelist) + 1, ] = c('No', nrow(bestsellingmanga[bestsellingmanga$anime_adaption == "No", ]))
ggplot(data=animelist, aes(x=anime_adaption, y=count)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
The first one shows that how many of the manga in the bestsellingmanga dataset have an anime adaption. An adaption could help get more eyes on your manga so I thought it might be a good variable for this. As shown, 82 out of the 99 best selling mangas did have some anime adaption. With such a high percent of them having one, I think it must be a clear indicator for success.
manga_genres$count <- 1
top_genres <-
manga_genres %>% group_by(genre) %>%
summarise(count=sum(count),
.groups = 'drop')
top_genres <- top_genres[order(top_genres$count, decreasing = TRUE), ]
ggplot(data=top_genres, aes(x=genre, y=count)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
manga_themes$count <- 1
top_themes <-
manga_themes %>% group_by(theme) %>%
summarise(count=sum(count),
.groups = 'drop')
top_themes <- top_themes[order(top_themes$count, decreasing = TRUE), ]
ggplot(data=top_themes, aes(x=theme, y=count)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
The two graphs above are bar graphs that show how many times
different genres and themes appeared in the bestsellingmanga dataset.
There are clear topics that pop up a lot.
For genres, the top 5 are: Award Winning, Comedy, Action, Drama,
Romance.
For themes, the top 5 are: School, Team Sports, Historical, Delinquents,
Psychological.
For my last analysis, I took sample sizes of the big myanimelist dataset and compare the probability that those randomly choose shared genres and themes with the top five from the bestsellingmanga. The top genres and themes of the bestsellingmanga can be an indictor for success as a lot of them seems to have some in common.
Below are my code and probability results for three sample sizes: 10, 50 and 100. In order to calculate this, I first created the sample for the size I wanted. Then, I created two long datasets for the sample’s genres and themes like I did for bestsellingmanga. I used theses to create other two dataset that held the tally of what genres and themes showed up in the sample plus how many times. Then to find the probability that the sample contained the same genres and themes as bestsellingmanga, I for looped through the sample tally dataset and kept a counter for how many times the top 5 genres/mangas (from bestsellingmanga) popped up. After, I divided it by how many genre/themes the sample had in it.
set.seed(49568)
samp <- myanimelist %>%
sample_n(10)
samp
## manga_id title type score scored_by status volumes
## 1 61503 Hanazono no Kioku manga 7.41 1094 finished 1
## 2 56513 Fudatsuki no Kyouko-chan manga 7.05 6942 finished 7
## 3 1820 Hoshigari Love Dollar manga 6.91 837 finished 3
## 4 103209 Ajuutan manga NA 13 finished 8
## 5 108492 Busu ni Hanataba wo. manga 7.24 626 finished 12
## 6 145530 Tonda Couple manga NA 4 finished 15
## 7 8286 Mizugokoro one_shot NA 89 finished NA
## 8 157476 A Handsome Swordsman manhwa NA 19 finished NA
## 9 22455 Toxic manga 6.90 636 finished 3
## 10 81843 Money♥Honey one_shot NA 71 finished NA
## chapters start_date end_date members favorites sfw
## 1 7 2013-03-07 2013-07-05 2121 8 False
## 2 37 2013-08-12 2016-06-11 16100 69 True
## 3 9 2003-01-01 <NA> 1751 1 True
## 4 81 2016-12-11 2020-01-26 197 0 True
## 5 74 2016-04-04 2022-09-02 2672 20 True
## 6 NA 1978-02-22 1981-02-25 38 0 True
## 7 1 2002-06-20 2002-06-20 170 0 False
## 8 100 2020-12-09 2022-06-28 91 0 True
## 9 15 2010-06-14 2011-01-01 1821 11 True
## 10 1 2007-04-11 2007-04-11 173 0 False
## genres themes
## 1 ['Boys Love', 'Drama', 'Erotica'] []
## 2 ['Comedy', 'Romance', 'Supernatural'] ['School', 'Vampire']
## 3 [] []
## 4 ['Action', 'Drama', 'Fantasy', 'Horror'] []
## 5 ['Comedy', 'Romance'] ['School']
## 6 ['Award Winning', 'Romance'] ['Love Polygon', 'School']
## 7 ['Boys Love', 'Erotica'] []
## 8 ['Action', 'Fantasy'] ['Martial Arts']
## 9 ['Action'] ['Military']
## 10 ['Hentai'] []
## demographics serializations anime_adaption
## 1 <NA> Magazine No
## 2 Shounen Gessan No
## 3 Shoujo Sho No
## 4 <NA> Ura No
## 5 Seinen Young Yes
## 6 Shounen Shounen No
## 7 <NA> <NA> No
## 8 <NA> Kakao No
## 9 Shoujo Comic No
## 10 <NA> Comic No
samp_genres <- data.frame(matrix(ncol=2,nrow=0))
colnames(samp_genres) <- c('title','genre')
for (i in 1:nrow(samp)){
t <- str_extract_all(samp$genres[i],"[A-Za-z]+(-[A-Za-z]+)?( [A-Za-z]+ [A-Za-z]+)?( [A-Za-z]+)?")
for(m in t){
for(n in m){
samp_genres[nrow(samp_genres) + 1, ] = c(samp$title[i], n)
}
}
}
samp_genres <- na.omit(samp_genres)
samp_genres$count <- 1
samp_genres_stats10 <- samp_genres %>% group_by(genre) %>%
summarise(count=sum(count),
.groups = 'drop')
samp_genres_stats10 <- samp_genres_stats10[order(samp_genres_stats10$count, decreasing = TRUE), ]
samp_genres_stats10
## # A tibble: 11 × 2
## genre count
## <chr> <dbl>
## 1 Action 3
## 2 Romance 3
## 3 Boys Love 2
## 4 Comedy 2
## 5 Drama 2
## 6 Erotica 2
## 7 Fantasy 2
## 8 Award Winning 1
## 9 Hentai 1
## 10 Horror 1
## 11 Supernatural 1
t <- 0
for(x in 1:nrow(samp_genres_stats10)){
if(samp_genres_stats10$genre[x] %in% head(top_genres$genre, 5)){
t <- t + samp_genres_stats10$count[x]
}
}
samp_themes <- data.frame(matrix(ncol=2,nrow=0))
colnames(samp_themes) <- c('title','theme')
for (i in 1:nrow(samp)){
t <- str_extract_all(samp$themes[i],"[A-Za-z]+(-[A-Za-z]+)?( [A-Za-z]+ [A-Za-z]+)?( [A-Za-z]+)?")
for(m in t){
for(n in m){
samp_themes[nrow(samp_themes) + 1, ] = c(samp$title[i], n)
}
}
}
samp_themes <- na.omit(samp_themes)
samp_themes$count <- 1
samp_themes_stats10 <- samp_themes %>% group_by(theme) %>%
summarise(count=sum(count),
.groups = 'drop')
samp_themes_stats10 <- samp_themes_stats10[order(samp_themes_stats10$count, decreasing = TRUE), ]
samp_themes_stats10
## # A tibble: 5 × 2
## theme count
## <chr> <dbl>
## 1 School 3
## 2 Love Polygon 1
## 3 Martial Arts 1
## 4 Military 1
## 5 Vampire 1
t <- 0
for(x in 1:nrow(samp_themes_stats10)){
if(samp_themes_stats10$theme[x] %in% head(top_themes$theme, 5)){
t <- t + samp_themes_stats10$count[x]
}
}
Above is the code for sample size 10.
The results show that sample’s genres matches 15% of the
bestsellingmanga.
The results show that sample’s themes matches 42.8571429% of the
bestsellingmanga.
set.seed(493954)
samp <- myanimelist %>%
sample_n(50)
samp_genres <- data.frame(matrix(ncol=2,nrow=0))
colnames(samp_genres) <- c('title','genre')
for (i in 1:nrow(samp)){
t <- str_extract_all(samp$genres[i],"[A-Za-z]+(-[A-Za-z]+)?( [A-Za-z]+ [A-Za-z]+)?( [A-Za-z]+)?")
for(m in t){
for(n in m){
samp_genres[nrow(samp_genres) + 1, ] = c(samp$title[i], n)
}
}
}
samp_genres <- na.omit(samp_genres)
samp_genres$count <- 1
samp_genres_stats50 <- samp_genres %>% group_by(genre) %>%
summarise(count=sum(count),
.groups = 'drop')
samp_genres_stats50 <- samp_genres_stats50[order(samp_genres_stats50$count, decreasing = TRUE), ]
t <- 0
for(x in 1:nrow(samp_genres_stats50)){
if(samp_genres_stats50$genre[x] %in% head(top_genres$genre, 5)){
t <- t + samp_genres_stats50$count[x]
}
}
samp_themes <- data.frame(matrix(ncol=2,nrow=0))
colnames(samp_themes) <- c('title','theme')
for (i in 1:nrow(samp)){
t <- str_extract_all(samp$themes[i],"[A-Za-z]+(-[A-Za-z]+)?( [A-Za-z]+ [A-Za-z]+)?( [A-Za-z]+)?")
for(m in t){
for(n in m){
samp_themes[nrow(samp_themes) + 1, ] = c(samp$title[i], n)
}
}
}
samp_themes <- na.omit(samp_themes)
samp_themes$count <- 1
samp_themes_stats50 <- samp_themes %>% group_by(theme) %>%
summarise(count=sum(count),
.groups = 'drop')
samp_themes_stats50 <- samp_themes_stats50[order(samp_themes_stats50$count, decreasing = TRUE), ]
t <- 0
for(x in 1:nrow(samp_themes_stats50)){
if(samp_themes_stats50$theme[x] %in% head(top_themes$theme, 5)){
t <- t + samp_themes_stats50$count[x]
}
}
Above is the code for sample size 50.
The results show that sample’s genres matches 15% of the
bestsellingmanga.
The results show that sample’s themes matches 68.1818182% of the
bestsellingmanga.
set.seed(493024)
samp <- myanimelist %>%
sample_n(100)
samp_genres <- data.frame(matrix(ncol=2,nrow=0))
colnames(samp_genres) <- c('title','genre')
for (i in 1:nrow(samp)){
t <- str_extract_all(samp$genres[i],"[A-Za-z]+(-[A-Za-z]+)?( [A-Za-z]+ [A-Za-z]+)?( [A-Za-z]+)?")
for(m in t){
for(n in m){
samp_genres[nrow(samp_genres) + 1, ] = c(samp$title[i], n)
}
}
}
samp_genres <- na.omit(samp_genres)
samp_genres$count <- 1
samp_genres_stats100 <- samp_genres %>% group_by(genre) %>%
summarise(count=sum(count),
.groups = 'drop')
samp_genres_stats100 <- samp_genres_stats100[order(samp_genres_stats100$count, decreasing = TRUE), ]
t <- 0
for(x in 1:nrow(samp_genres_stats100)){
if(samp_genres_stats100$genre[x] %in% head(top_genres$genre, 5)){
t <- t + samp_genres_stats100$count[x]
}
}
samp_themes <- data.frame(matrix(ncol=2,nrow=0))
colnames(samp_themes) <- c('title','theme')
for (i in 1:nrow(samp)){
t <- str_extract_all(samp$themes[i],"[A-Za-z]+(-[A-Za-z]+)?( [A-Za-z]+ [A-Za-z]+)?( [A-Za-z]+)?")
for(m in t){
for(n in m){
samp_themes[nrow(samp_themes) + 1, ] = c(samp$title[i], n)
}
}
}
samp_themes <- na.omit(samp_themes)
samp_themes$count <- 1
samp_themes_stats100 <- samp_themes %>% group_by(theme) %>%
summarise(count=sum(count),
.groups = 'drop')
samp_themes_stats100 <- samp_themes_stats100[order(samp_themes_stats100$count, decreasing = TRUE), ]
t <- 0
for(x in 1:nrow(samp_themes_stats100)){
if(samp_themes_stats100$theme[x] %in% head(top_themes$theme, 5)){
t <- t + samp_themes_stats100$count[x]
}
}
Above is the code for sample size 100.
The results show that sample’s genres matches 13.6612022% of the
bestsellingmanga.
The results show that sample’s themes matches 47.1698113% of the
bestsellingmanga.
With the results from the three sample sizes, we can see that the higher the sample size, the lower the percentage of genres/themes that match the top five in bestsellingmanga. None of these samples reach past 70% meaning there are a lot of genres/themes that manga can fit into and the odds of writing for something more likely to get popular is low.
The three variables I decided to look: if there’s an anime adaptation, it’s genres, and it’s themes where all shown to have an affect on the success of a new manga to varies degree.
Anime adaptations was the strongest factor as it has a 17:82 (without/with an adaptation) for best selling mangas which resulted in a ~82.82% chance that a best-selling manga had an anime. This is probably due to people who watch a lot of anime will see this new show/movie and if they like it, will check out the manga behind it.
Genres & Themes had clear subject topics that a lot of the best selling mangas fall under. This can not be a __. There are certain genres that the public or the manga reading community gravitate_ towards more. So if your new manga happens to fall in those topics, it could have more eyes/readers looking at it because it’s writing about something popular.
For my one feature that we did not talk about in class, I created my presentation through RMarkdown presentation. I did not know that was something you can do until I saw it as an example. It was super easy to use especially since I can just transfer my code directy from my RMarkdown document. My big takeaway/next step would be to figure out how to change font size or crop words/code because some of my slides had lines go off the page.
For my clean up and analysis, I used a lot of for loops that slowed my computer down especially my failed attempt at incorporating myanimelist_anime where I tried to for loop twice through the 64,833 rows of myanimelist_manga. My takeway for this is learning how to do the calculations I did without for loops. For example, I realized after but for the sample size percentag calculations, could have done sum(str_detect(samp\(genres, head(top_genres\)genre, 5)) for each genre (or theme) in the sample size and added it up.
Next steps in general for this project would be diving more into testing samples using probability and testing more variables that might have also be big success factors (i.e. start_date, anime_length).