Introduction
Dataset explanation
WEBTOON is a digital comic publisher launched by Naver Corporation in South Korea in 2004. The platform launched first in Korea as Naver Webtoon and then globally as LINE Webtoon in July 2014 that can be accessed through website and mobile app. Webtoon offers hundred web-comics in numerous genres.
This project aims to dive deep on webtoon dataset derived from Kaggle. This dataset contains information of more than 500 webtoon originals per September 23, 2021. Let’s see what interesting information we can find in this dataset.
Questions
There are some things I am curious about from this dataset:
- Which comic has the highest number of likes and subscribers?
- Is there any writer who publishes more than 1 comics?
- How many comic genres that have been published and which comic genres are published the most?
- Which genre is the most favorable by readers? (In terms of number of likes and subscribers)
- Which comics are updated most often?
- On which days do writers update their comics the most?
Load libraries
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(rmarkdown)
Data Preparation
Read and Inspect Dataset
<- read_csv("Webtoon.csv") webtoon
##
## -- Column specification --------------------------------------------------------
## cols(
## id = col_double(),
## Name = col_character(),
## Writer = col_character(),
## Likes = col_character(),
## Genre = col_character(),
## Rating = col_double(),
## Subscribers = col_character(),
## Summary = col_character(),
## Update = col_character(),
## `Reading Link` = col_character()
## )
str(webtoon)
## spec_tbl_df[,10] [1,114 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:1114] 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr [1:1114] "Let's Play" "True Beauty" "Midnight Poppy Land" "Age Matters" ...
## $ Writer : chr [1:1114] "Leeanne M. Krecic (Mongie)" "Yaongyi" "Lilydusk" "Enjelicious" ...
## $ Likes : chr [1:1114] "30.6M" "39.9M" "10.4M" "25.9M" ...
## $ Genre : chr [1:1114] "Romance" "Romance" "Romance" "Romance" ...
## $ Rating : num [1:1114] 9.62 9.6 9.81 9.79 9.85 9.82 9.66 9.87 9.82 9.78 ...
## $ Subscribers : chr [1:1114] "4.2M" "6.4M" "2.1M" "3.5M" ...
## $ Summary : chr [1:1114] "She's young, single and about to achieve her dream of creating incredible videogames. But then life throws her "| __truncated__ "After binge-watching beauty videos online, a shy comic book fan masters the art of makeup and sees her social s"| __truncated__ "After making a grisly discovery in the countryside, a small town book editor's life gets entangled with a young"| __truncated__ "She's a hopeless romantic who's turning 30's and is not super happy about it. He's a reclusive billionaire who"| __truncated__ ...
## $ Update : chr [1:1114] "UP EVERY TUESDAY" "UP EVERY WEDNESDAY" "UP EVERY SATURDAY" "UP EVERY WEDNESDAY" ...
## $ Reading Link: chr [1:1114] "https://www.webtoons.com/en/romance/letsplay/list?title_no=1218" "https://www.webtoons.com/en/romance/truebeauty/list?title_no=1436" "https://www.webtoons.com/en/romance/midnight-poppy-land/list?title_no=1798" "https://www.webtoons.com/en/romance/age-matters/list?title_no=1364" ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. Name = col_character(),
## .. Writer = col_character(),
## .. Likes = col_character(),
## .. Genre = col_character(),
## .. Rating = col_double(),
## .. Subscribers = col_character(),
## .. Summary = col_character(),
## .. Update = col_character(),
## .. `Reading Link` = col_character()
## .. )
::paged_table(webtoon) rmarkdown
There are 10 variables in this dataset. Here is the explanation of the variables: - id - Unique id identifying the comic.
- Name - Full name of the comic.
- Writer - Author of the comic.
- Likes - Total no. of Likes.
- Genre - Genres for comic.
- Rating - An average rating out of 10 for the comic.
- Subscribers - Total no. of subscribers.
- Summary - Summary of the comic.
- Update - Day of the week for an update.
- Reading Link - Link where you can read the comic.
Looks like some variables do not have the correct data type and need to be converted into the correct ones.
Data Wrangling
Missing Values
anyNA(webtoon)
## [1] TRUE
colSums(is.na(webtoon))
## id Name Writer Likes Genre Rating
## 0 0 2 0 0 0
## Subscribers Summary Update Reading Link
## 0 0 0 0
Looks like there are 2 missing values in writer columns.
%>% filter(is.na(Writer)) webtoon
## # A tibble: 2 x 10
## id Name Writer Likes Genre Rating Subscribers Summary Update
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 254 My Room~ <NA> 1M Roman~ 9.62 257.3K Under normal ci~ UP EVE~
## 2 748 The Nun~ <NA> 1M Drama 9.57 204.9K When Manager So~ UP EVE~
## # ... with 1 more variable: Reading Link <chr>
The writers for My Roommate is a Gumiho and The Nuna at Our Office are missing. After checking from the original website, I found that the writer of My Roommate is a Gumiho is NA
and the writer of The Nuna at Our Office is WasakBasak / JANE
.
$id == 254, ]$Writer <- "NA"
webtoon[webtoon
$id == 748, ]$Writer <- "WasakBasak / JANE" webtoon[webtoon
Let’s check if the missing values have been succesfully changed.
anyNA(webtoon)
## [1] FALSE
That’s great!
Duplicated Values
Before starting the analysis, let’s see if there are duplicated values.
length(unique(webtoon$Name))
## [1] 569
There are 1,114 rows in this dataset, meanwhile there are only 569 different comic titles. It means that there are some duplicated comic titles. To make sure, let’s see which titles are duplicated.
duplicated(webtoon$Name) | duplicated(webtoon$Name, fromLast=TRUE), ] %>% arrange(Name) %>% paged_table() webtoon[
Let’s remove the duplicated values.
<- webtoon %>%
webtoon distinct(Name, .keep_all = T)
length(unique(webtoon$Name)) == nrow(webtoon)
## [1] TRUE
Now, we’re good to go.
Data Manipulation
Number of subscribers
In this dataset, the subscribers variable type is character, when it’s supposed to be numeric.
head(webtoon$Subscribers, 10)
## [1] "4.2M" "6.4M" "2.1M" "3.5M" "1.5M" "3M" "649K" "537.6K"
## [9] "1.1M" "4.3M"
Total number of subcribers ranges from thousands to millions. Webtoon website from which this data was scraped does not provide the number of subscribers to the smallest unit, so in this analysis, I will round down the total number of subscribers to the nearest hundreds of thousands for comic with millions subscribers and to the nearest thousands for comic with thousands subcribers.
$Multiplier <- ifelse(str_detect(webtoon$Subscribers, "M"), 1000000, 1000)
webtoon$Subscribers <- str_replace(webtoon$Subscribers, "M|K|k", "") %>% as.numeric()
webtoon<- webtoon %>% mutate(Subscribers = Subscribers * Multiplier) %>% select(-Multiplier)
webtoon head(webtoon$Subscribers, 10)
## [1] 4200000 6400000 2100000 3500000 1500000 3000000 649000 537600 1100000
## [10] 4300000
Likes
Same as subscribers variable, likes variable also does not have the correct data type.
head(webtoon$Likes, 20)
## [1] "30.6M" "39.9M" "10.4M" "25.9M" "9.9M" "18.9M"
## [7] "2.9M" "9,29,796" "5.8M" "29M" "9.3M" "10.2M"
## [13] "1.5M" "8.3M" "2,17,959" "10.8M" "7,90,313" "9.9M"
## [19] "1.7M" "3.8M"
The number of likes above 1 million is also rounded down to the nearest hundreds of thousands. Meanwhile, there was an error a comma placing error in comic with under 1 million likes. Let’s fix these issues.
$Multiplier <- ifelse(str_detect(webtoon$Likes, ","), 1, 1000000)
webtoon$Likes <- str_replace_all(webtoon$Likes, "M|,", "") %>% as.numeric()
webtoon<- webtoon %>% mutate(Likes = Likes * Multiplier) %>% select(-Multiplier) webtoon
Genre
Genre should be converted into factors.
$Genre <- factor(webtoon$Genre)
webtoonunique(webtoon$Genre)
## [1] Romance Supernatural Fantasy Action Drama
## [6] Thriller Mystery Historical Comedy Sci-fi
## [11] Slice of life Heartwarming Superhero Sports Informative
## [16] Horror
## 16 Levels: Action Comedy Drama Fantasy Heartwarming Historical ... Thriller
Looks like there are 16 comic genres. We’ll explore them more later on.
Update
unique(webtoon$Update)
## [1] "UP EVERY TUESDAY" "UP EVERY WEDNESDAY"
## [3] "UP EVERY SATURDAY" "UP EVERY THURSDAY"
## [5] "UP EVERY SUNDAY" "UP EVERY MONDAY"
## [7] "UP EVERY FRIDAY" "UP EVERY TUE, FRI"
## [9] "UP EVERY MON, THU" "UP EVERY TUE, SAT"
## [11] "UP EVERY WED, THU, FRI, SAT, SUN" "COMPLETED"
## [13] "UP EVERY WED, SUN" "UP EVERY THU, SAT"
## [15] "UP EVERY WED, SAT" "UP EVERY MON, WED, FRI"
## [17] "UP EVERY MON, FRI" "UP EVERY THU, SUN"
## [19] "UP EVERY TUE, SUN" "UP EVERY MON, WED"
## [21] "UP EVERY TUE, THU, SAT" "UP EVERY MON, TUE, WED, THU, SUN"
Based on the update time, there are webtoons that are updated once a week, more than once a week, and are complete. Let’s find out exactly how many times each comic is updated in a week.
<- webtoon %>% mutate(Times_per_Week = ifelse(Update == "COMPLETED", Update, str_count(Update, ",") + 1))
webtoon %>% count(Times_per_Week) %>% mutate(Percentage = round(n/sum(n)*100, 2)) %>% paged_table() webtoon
Almost half of the comics are updated once a week, and there are 3 comics that are updated 5 times a week. Their readers must be happy, right?
To see the number and titles of comics that are updated each day of the week, we can create a new variable according to the number of days in a week which states TRUE
for comics that are updated on that day and FALSE
otherwise.
$Completed <- ifelse(webtoon$Update == "COMPLETED", T, F)
webtoon$Monday <- ifelse(str_detect(webtoon$Update, "MONDAY|MON"), T, F)
webtoon$Tuesday <- ifelse(str_detect(webtoon$Update, "TUESDAY|TUE"), T, F)
webtoon$Wednesday <- ifelse(str_detect(webtoon$Update, "WEDNESDAY|WED"), T, F)
webtoon$Thursday <- ifelse(str_detect(webtoon$Update, "THURSDAY|THU"), T, F)
webtoon$Friday <- ifelse(str_detect(webtoon$Update, "FRIDAY|FRI"), T, F)
webtoon$Saturday <- ifelse(str_detect(webtoon$Update, "SATURDAY|SAT"), T, F)
webtoon$Sunday <- ifelse(str_detect(webtoon$Update, "SUNDAY|SUN"), T, F) webtoon
Writers
length(unique(webtoon$Writer))
## [1] 490
There are 490 writers who publish comics, while there are 569 comic titles, so it means some writers publish more than 1 comics. We’ll explore it more later.
And we won’t need the id column anymore, so it’s better to remove it.
<- webtoon %>% select(-1) webtoon
Data Exploration
Now, we’ll use the manipulated dataset to answer my curiosity about webtoon world.
- Which comic has the highest number of likes and subscribers?
%>% arrange(desc(Likes)) %>% head() %>% paged_table() webtoon
%>% arrange(desc(Subscribers)) %>% head() %>% paged_table() webtoon
My Giant Nerd Boyfriend by fishball has the highest number of likes, 50.6 M. Meanwhile, True Beauty is a comic with the highest number of subscribers and ranks 5 in terms of number of likes.
- Which comic has the highest rating?
%>% arrange(desc(Rating)) %>% head() %>% paged_table() webtoon
Your Letter by Hyeon A Cho and Eleceed by Jeho Son / ZHENA both have the highest ratings, 9.93.
- Is there any writer who publishes more than 1 comics?
%>% count(Writer, sort = T) %>% filter(n > 1) %>% paged_table() webtoon
There are 60 writers who have published more than 1 comics, with Various Artists or when writers collaborate to create comics, published 10 comics. What are they?
%>% filter(Writer == "Various Artists") %>% paged_table() webtoon
Meanwhile, I want to explore more for the individual writers with the second highest number of published comics.
%>% filter(Writer %in% c("Dean Haspiel", "Donggeon Lee", "Ilkwon Ha")) %>%
webtoon arrange(Writer) %>% paged_table()
Dean Haspiel
, Donggeon Lee
, and Ilkwon Ha
each have published 4 comics, with Dean Haspiel specializes in superhero comics, while the other 2 published comics with various genres.
- How many comic genres that have been published and which comic genres are published the most?
<- webtoon %>% count(Genre, sort = T)
genre <- ggplot(genre, aes(reorder(Genre, n), n, text = paste("Number of Comics:", n))) +
plot_genre geom_segment(aes(x = reorder(Genre, n), xend = reorder(Genre, n), y = 0, yend = n), size = 1.7, color = "gray90", alpha = .8) +
geom_point(color = "limegreen", size= 5) +
coord_flip() +
labs(title = "Webtoon Genres", x = "", y = "") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(plot_genre, tooltip = "text")
There are 16 genres published so far and Fantasy, Romance, and Drama are 3 most published comic genres in Webtoon.
- Which genre is the most favorable by readers?
<- webtoon %>% group_by(Genre) %>%
favorite summarize(Likes = median(Likes), Subscribers = median(Subscribers), Rating = round(median(Rating), 2))
?prettyNum
## starting httpd help server ... done
<- favorite %>%
likes arrange(desc(Likes)) %>%
mutate(Likes = paste0(Genre, " (", prettyNum(Likes, big.mark = ",", scientific = F), " Likes", ")")) %>%
select(2) %>%
head(3)
<- favorite %>%
rating arrange(desc(Rating)) %>%
mutate(Rating = paste0(Genre, " (", Rating, ")")) %>%
select(4) %>%
head(3)
<- favorite %>%
subs arrange(desc(Subscribers)) %>%
mutate(Subscribers = paste0(Genre, " (", prettyNum(Subscribers, big.mark = ",", scientific = F), " Subscribers", ")")) %>%
select(3) %>%
head(3)
<- data.frame(Rank = 1:3) %>% cbind(likes, subs, rating)
top3 names(top3)[2:4] <- c("Likes", "Subscribers", "Rating")
%>% paged_table() top3
Since the distribution of number of likes, subscribers and rating are skewed, it’s better to use median as the measure of central tendency. As shown on the table above, romance
is the favorite genre in terms of number of likes, number of subscribers, and rating.
- Which comics are updated most often?
%>% filter(Completed == F) %>% arrange(desc(as.numeric(Times_per_Week))) %>%
webtoon select(c(Name, Writer, Times_per_Week)) %>% paged_table()
HEART Anthology
from Marvin.W / caw-chan, Denma
by Gasfard, and BRAIN Anthology
by Various Artists are updated most often, i.e. 5 times a week.
- On which days do writers update their comics the most?
<- webtoon %>% filter(Completed == F) %>% select(Monday:Sunday) %>%
webtoon_day_updated colSums() %>%
data.frame()
<- rownames_to_column(webtoon_day_updated)
webtoon_day_updated names(webtoon_day_updated) <- c("Day of Week", "Number of Comics")
<- webtoon_day_updated %>%
webtoon_day_updated mutate(`Day of Week` = factor(`Day of Week`,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday")))
<- webtoon_day_updated %>% ggplot(aes(`Day of Week`, `Number of Comics`)) +
plot_day geom_segment(aes(x = `Day of Week`, xend = `Day of Week`, y = 0, yend = `Number of Comics`), color = "gray90", size = 2) +
geom_point(color = "limegreen", size = 5) +
theme_light() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank()
+
) labs(x = "",
y = "",
title = "Number of Comics Updated Each Day") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(plot_day)
There are more comics updated on Tuesday, Friday, and Saturday than other days, even though the difference is not large. Sunday is the day with the lowest number of comic update.
Conclusions
Based on the analysis above, it can be concluded that: - To date, My Giant Nerd Boyfriend by fishball has the highest number of likes, 50.6 M. Meanwhile, True Beauty by Yaongyi is a comic with the highest number of subscribers.
- Meanwhile in terms of rating, Your Letter by Hyeon A Cho and Eleceed by Jeho Son / ZHENA both have the highest ratings, 9.93.
- There are writers who have published more than 1 comics.
- There are 16 comic genres that have been published and Fantasy, Romance, and Drama are 3 most published comic genres in Webtoon.
- Romance is readers’ favorite comic genre.
- Almost half of the comics are completed. And for those that are still ongoing, the comics are updated between 1 to 5 times a week, with majority of them are updated once a week.
- More comics are updated on Tuesdays, Fridays, and Saturdays than other days.