Webtoon

Eldora Nadellia

10/1/2021

Introduction

Dataset explanation

WEBTOON is a digital comic publisher launched by Naver Corporation in South Korea in 2004. The platform launched first in Korea as Naver Webtoon and then globally as LINE Webtoon in July 2014 that can be accessed through website and mobile app. Webtoon offers hundred web-comics in numerous genres.

This project aims to dive deep on webtoon dataset derived from Kaggle. This dataset contains information of more than 500 webtoon originals per September 23, 2021. Let’s see what interesting information we can find in this dataset.

Questions

There are some things I am curious about from this dataset:
- Which comic has the highest number of likes and subscribers?
- Is there any writer who publishes more than 1 comics?
- How many comic genres that have been published and which comic genres are published the most?
- Which genre is the most favorable by readers? (In terms of number of likes and subscribers)
- Which comics are updated most often?
- On which days do writers update their comics the most?

Load libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(rmarkdown)

Data Preparation

Read and Inspect Dataset

webtoon <- read_csv("Webtoon.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   id = col_double(),
##   Name = col_character(),
##   Writer = col_character(),
##   Likes = col_character(),
##   Genre = col_character(),
##   Rating = col_double(),
##   Subscribers = col_character(),
##   Summary = col_character(),
##   Update = col_character(),
##   `Reading Link` = col_character()
## )

str(webtoon)

## spec_tbl_df[,10] [1,114 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id          : num [1:1114] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr [1:1114] "Let's Play" "True Beauty" "Midnight Poppy Land" "Age Matters" ...
##  $ Writer      : chr [1:1114] "Leeanne M. Krecic (Mongie)" "Yaongyi" "Lilydusk" "Enjelicious" ...
##  $ Likes       : chr [1:1114] "30.6M" "39.9M" "10.4M" "25.9M" ...
##  $ Genre       : chr [1:1114] "Romance" "Romance" "Romance" "Romance" ...
##  $ Rating      : num [1:1114] 9.62 9.6 9.81 9.79 9.85 9.82 9.66 9.87 9.82 9.78 ...
##  $ Subscribers : chr [1:1114] "4.2M" "6.4M" "2.1M" "3.5M" ...
##  $ Summary     : chr [1:1114] "She's young, single and about to achieve her dream of creating incredible videogames. But then life throws her "| __truncated__ "After binge-watching beauty videos online, a shy comic book fan masters the art of makeup and sees her social s"| __truncated__ "After making a grisly discovery in the countryside, a small town book editor's life gets entangled with a young"| __truncated__ "She's a hopeless romantic who's turning 30's  and is not super happy about it. He's a reclusive billionaire who"| __truncated__ ...
##  $ Update      : chr [1:1114] "UP EVERY TUESDAY" "UP EVERY WEDNESDAY" "UP EVERY SATURDAY" "UP EVERY WEDNESDAY" ...
##  $ Reading Link: chr [1:1114] "https://www.webtoons.com/en/romance/letsplay/list?title_no=1218" "https://www.webtoons.com/en/romance/truebeauty/list?title_no=1436" "https://www.webtoons.com/en/romance/midnight-poppy-land/list?title_no=1798" "https://www.webtoons.com/en/romance/age-matters/list?title_no=1364" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   Name = col_character(),
##   ..   Writer = col_character(),
##   ..   Likes = col_character(),
##   ..   Genre = col_character(),
##   ..   Rating = col_double(),
##   ..   Subscribers = col_character(),
##   ..   Summary = col_character(),
##   ..   Update = col_character(),
##   ..   `Reading Link` = col_character()
##   .. )

rmarkdown::paged_table(webtoon)

There are 10 variables in this dataset. Here is the explanation of the variables: - id - Unique id identifying the comic.
- Name - Full name of the comic.
- Writer - Author of the comic.
- Likes - Total no. of Likes.
- Genre - Genres for comic.
- Rating - An average rating out of 10 for the comic.
- Subscribers - Total no. of subscribers.
- Summary - Summary of the comic.
- Update - Day of the week for an update.
- Reading Link - Link where you can read the comic.

Looks like some variables do not have the correct data type and need to be converted into the correct ones.

Data Wrangling

Missing Values

anyNA(webtoon)

## [1] TRUE

colSums(is.na(webtoon))

##           id         Name       Writer        Likes        Genre       Rating 
##            0            0            2            0            0            0 
##  Subscribers      Summary       Update Reading Link 
##            0            0            0            0

Looks like there are 2 missing values in writer columns.

webtoon %>% filter(is.na(Writer))

## # A tibble: 2 x 10
##      id Name     Writer Likes Genre  Rating Subscribers Summary          Update 
##   <dbl> <chr>    <chr>  <chr> <chr>   <dbl> <chr>       <chr>            <chr>  
## 1   254 My Room~ <NA>   1M    Roman~   9.62 257.3K      Under normal ci~ UP EVE~
## 2   748 The Nun~ <NA>   1M    Drama    9.57 204.9K      When Manager So~ UP EVE~
## # ... with 1 more variable: Reading Link <chr>

The writers for My Roommate is a Gumiho and The Nuna at Our Office are missing. After checking from the original website, I found that the writer of My Roommate is a Gumiho is NA and the writer of The Nuna at Our Office is WasakBasak / JANE.

webtoon[webtoon$id == 254, ]$Writer <- "NA"

webtoon[webtoon$id == 748, ]$Writer <- "WasakBasak / JANE"

Let’s check if the missing values have been succesfully changed.

anyNA(webtoon)

## [1] FALSE

That’s great!

Duplicated Values

Before starting the analysis, let’s see if there are duplicated values.

length(unique(webtoon$Name))

## [1] 569

There are 1,114 rows in this dataset, meanwhile there are only 569 different comic titles. It means that there are some duplicated comic titles. To make sure, let’s see which titles are duplicated.

webtoon[duplicated(webtoon$Name) | duplicated(webtoon$Name, fromLast=TRUE), ] %>% arrange(Name) %>% paged_table()

Let’s remove the duplicated values.

webtoon <- webtoon %>%
            distinct(Name, .keep_all = T)

length(unique(webtoon$Name)) == nrow(webtoon)

## [1] TRUE

Now, we’re good to go.

Data Manipulation

Number of subscribers

In this dataset, the subscribers variable type is character, when it’s supposed to be numeric.

head(webtoon$Subscribers, 10)

##  [1] "4.2M"   "6.4M"   "2.1M"   "3.5M"   "1.5M"   "3M"     "649K"   "537.6K"
##  [9] "1.1M"   "4.3M"

Total number of subcribers ranges from thousands to millions. Webtoon website from which this data was scraped does not provide the number of subscribers to the smallest unit, so in this analysis, I will round down the total number of subscribers to the nearest hundreds of thousands for comic with millions subscribers and to the nearest thousands for comic with thousands subcribers.

webtoon$Multiplier <- ifelse(str_detect(webtoon$Subscribers, "M"), 1000000, 1000)
webtoon$Subscribers <- str_replace(webtoon$Subscribers, "M|K|k", "") %>% as.numeric()
webtoon <- webtoon %>% mutate(Subscribers = Subscribers * Multiplier) %>% select(-Multiplier)
head(webtoon$Subscribers, 10)

##  [1] 4200000 6400000 2100000 3500000 1500000 3000000  649000  537600 1100000
## [10] 4300000

Likes

Same as subscribers variable, likes variable also does not have the correct data type.

head(webtoon$Likes, 20)

##  [1] "30.6M"    "39.9M"    "10.4M"    "25.9M"    "9.9M"     "18.9M"   
##  [7] "2.9M"     "9,29,796" "5.8M"     "29M"      "9.3M"     "10.2M"   
## [13] "1.5M"     "8.3M"     "2,17,959" "10.8M"    "7,90,313" "9.9M"    
## [19] "1.7M"     "3.8M"

The number of likes above 1 million is also rounded down to the nearest hundreds of thousands. Meanwhile, there was an error a comma placing error in comic with under 1 million likes. Let’s fix these issues.

webtoon$Multiplier <- ifelse(str_detect(webtoon$Likes, ","), 1, 1000000)
webtoon$Likes <- str_replace_all(webtoon$Likes, "M|,", "") %>% as.numeric()
webtoon <- webtoon %>% mutate(Likes = Likes * Multiplier) %>% select(-Multiplier)

Genre

Genre should be converted into factors.

webtoon$Genre <- factor(webtoon$Genre)
unique(webtoon$Genre)

##  [1] Romance       Supernatural  Fantasy       Action        Drama        
##  [6] Thriller      Mystery       Historical    Comedy        Sci-fi       
## [11] Slice of life Heartwarming  Superhero     Sports        Informative  
## [16] Horror       
## 16 Levels: Action Comedy Drama Fantasy Heartwarming Historical ... Thriller

Looks like there are 16 comic genres. We’ll explore them more later on.

Update

unique(webtoon$Update)

##  [1] "UP EVERY TUESDAY"                 "UP EVERY WEDNESDAY"              
##  [3] "UP EVERY SATURDAY"                "UP EVERY THURSDAY"               
##  [5] "UP EVERY SUNDAY"                  "UP EVERY MONDAY"                 
##  [7] "UP EVERY FRIDAY"                  "UP EVERY TUE, FRI"               
##  [9] "UP EVERY MON, THU"                "UP EVERY TUE, SAT"               
## [11] "UP EVERY WED, THU, FRI, SAT, SUN" "COMPLETED"                       
## [13] "UP EVERY WED, SUN"                "UP EVERY THU, SAT"               
## [15] "UP EVERY WED, SAT"                "UP EVERY MON, WED, FRI"          
## [17] "UP EVERY MON, FRI"                "UP EVERY THU, SUN"               
## [19] "UP EVERY TUE, SUN"                "UP EVERY MON, WED"               
## [21] "UP EVERY TUE, THU, SAT"           "UP EVERY MON, TUE, WED, THU, SUN"

Based on the update time, there are webtoons that are updated once a week, more than once a week, and are complete. Let’s find out exactly how many times each comic is updated in a week.

webtoon <- webtoon %>% mutate(Times_per_Week = ifelse(Update == "COMPLETED", Update, str_count(Update, ",") + 1))
webtoon %>% count(Times_per_Week) %>% mutate(Percentage = round(n/sum(n)*100, 2)) %>% paged_table()

Almost half of the comics are updated once a week, and there are 3 comics that are updated 5 times a week. Their readers must be happy, right?

To see the number and titles of comics that are updated each day of the week, we can create a new variable according to the number of days in a week which states TRUE for comics that are updated on that day and FALSE otherwise.

webtoon$Completed <- ifelse(webtoon$Update == "COMPLETED", T, F)
webtoon$Monday <- ifelse(str_detect(webtoon$Update, "MONDAY|MON"), T, F)
webtoon$Tuesday <- ifelse(str_detect(webtoon$Update, "TUESDAY|TUE"), T, F)
webtoon$Wednesday <- ifelse(str_detect(webtoon$Update, "WEDNESDAY|WED"), T, F)
webtoon$Thursday <- ifelse(str_detect(webtoon$Update, "THURSDAY|THU"), T, F)
webtoon$Friday <- ifelse(str_detect(webtoon$Update, "FRIDAY|FRI"), T, F)
webtoon$Saturday <- ifelse(str_detect(webtoon$Update, "SATURDAY|SAT"), T, F)
webtoon$Sunday <- ifelse(str_detect(webtoon$Update, "SUNDAY|SUN"), T, F)

Writers

length(unique(webtoon$Writer))

## [1] 490

There are 490 writers who publish comics, while there are 569 comic titles, so it means some writers publish more than 1 comics. We’ll explore it more later.

And we won’t need the id column anymore, so it’s better to remove it.

webtoon <- webtoon %>% select(-1)

Data Exploration

Now, we’ll use the manipulated dataset to answer my curiosity about webtoon world.

Which comic has the highest number of likes and subscribers?

webtoon %>% arrange(desc(Likes)) %>% head() %>% paged_table()

webtoon %>% arrange(desc(Subscribers)) %>% head() %>% paged_table()

My Giant Nerd Boyfriend by fishball has the highest number of likes, 50.6 M. Meanwhile, True Beauty is a comic with the highest number of subscribers and ranks 5 in terms of number of likes.

Which comic has the highest rating?

webtoon %>% arrange(desc(Rating)) %>% head() %>% paged_table()

Your Letter by Hyeon A Cho and Eleceed by Jeho Son / ZHENA both have the highest ratings, 9.93.

Is there any writer who publishes more than 1 comics?

webtoon %>% count(Writer, sort = T) %>% filter(n > 1) %>% paged_table()

There are 60 writers who have published more than 1 comics, with Various Artists or when writers collaborate to create comics, published 10 comics. What are they?

webtoon %>% filter(Writer == "Various Artists") %>% paged_table()

Meanwhile, I want to explore more for the individual writers with the second highest number of published comics.

webtoon %>% filter(Writer %in% c("Dean Haspiel", "Donggeon Lee",    "Ilkwon Ha")) %>% 
  arrange(Writer) %>% paged_table()

Dean Haspiel, Donggeon Lee, and Ilkwon Ha each have published 4 comics, with Dean Haspiel specializes in superhero comics, while the other 2 published comics with various genres.

How many comic genres that have been published and which comic genres are published the most?

genre <- webtoon %>% count(Genre, sort = T)
plot_genre <- ggplot(genre, aes(reorder(Genre, n), n, text = paste("Number of Comics:", n))) +
  geom_segment(aes(x = reorder(Genre, n), xend = reorder(Genre, n), y = 0, yend = n), size = 1.7, color = "gray90", alpha = .8) +
  geom_point(color = "limegreen", size= 5) +
  coord_flip() + 
  labs(title = "Webtoon Genres", x = "", y = "") + 
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(plot_genre, tooltip = "text")

There are 16 genres published so far and Fantasy, Romance, and Drama are 3 most published comic genres in Webtoon.

Which genre is the most favorable by readers?

favorite <- webtoon %>% group_by(Genre) %>% 
  summarize(Likes = median(Likes), Subscribers = median(Subscribers), Rating = round(median(Rating), 2))
?prettyNum

## starting httpd help server ... done

likes <- favorite %>%
    arrange(desc(Likes)) %>%
    mutate(Likes = paste0(Genre, " (", prettyNum(Likes, big.mark = ",", scientific = F), " Likes", ")")) %>%
    select(2) %>% 
    head(3)
  
rating <- favorite %>%
    arrange(desc(Rating)) %>%
    mutate(Rating = paste0(Genre, " (", Rating, ")")) %>%
    select(4) %>% 
    head(3)

subs <- favorite %>%
    arrange(desc(Subscribers)) %>%
    mutate(Subscribers = paste0(Genre, " (", prettyNum(Subscribers, big.mark = ",", scientific = F), " Subscribers", ")")) %>%
    select(3) %>% 
    head(3)

top3 <- data.frame(Rank = 1:3) %>% cbind(likes, subs, rating)
names(top3)[2:4] <- c("Likes", "Subscribers", "Rating")
top3 %>% paged_table()

Since the distribution of number of likes, subscribers and rating are skewed, it’s better to use median as the measure of central tendency. As shown on the table above, romance is the favorite genre in terms of number of likes, number of subscribers, and rating.

Which comics are updated most often?

webtoon %>% filter(Completed == F) %>% arrange(desc(as.numeric(Times_per_Week))) %>% 
  select(c(Name, Writer, Times_per_Week)) %>% paged_table()

HEART Anthology from Marvin.W / caw-chan, Denma by Gasfard, and BRAIN Anthology by Various Artists are updated most often, i.e. 5 times a week.

On which days do writers update their comics the most?

webtoon_day_updated <- webtoon %>% filter(Completed == F) %>% select(Monday:Sunday) %>% 
  colSums() %>% 
  data.frame()

webtoon_day_updated <- rownames_to_column(webtoon_day_updated)  
names(webtoon_day_updated) <- c("Day of Week", "Number of Comics")

webtoon_day_updated <- webtoon_day_updated %>% 
                        mutate(`Day of Week` = factor(`Day of Week`, 
                                                      levels = c("Monday", "Tuesday", "Wednesday", 
                                                                 "Thursday", "Friday", "Saturday", "Sunday"))) 

plot_day <- webtoon_day_updated %>% ggplot(aes(`Day of Week`, `Number of Comics`)) +
  geom_segment(aes(x = `Day of Week`, xend = `Day of Week`, y = 0, yend = `Number of Comics`), color = "gray90", size = 2) +
  geom_point(color = "limegreen", size = 5) +
  theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank()
  ) +
  labs(x = "",
       y = "",
       title = "Number of Comics Updated Each Day") +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(plot_day)

There are more comics updated on Tuesday, Friday, and Saturday than other days, even though the difference is not large. Sunday is the day with the lowest number of comic update.

Conclusions

Based on the analysis above, it can be concluded that: - To date, My Giant Nerd Boyfriend by fishball has the highest number of likes, 50.6 M. Meanwhile, True Beauty by Yaongyi is a comic with the highest number of subscribers.
- Meanwhile in terms of rating, Your Letter by Hyeon A Cho and Eleceed by Jeho Son / ZHENA both have the highest ratings, 9.93.
- There are writers who have published more than 1 comics.
- There are 16 comic genres that have been published and Fantasy, Romance, and Drama are 3 most published comic genres in Webtoon.
- Romance is readers’ favorite comic genre.
- Almost half of the comics are completed. And for those that are still ongoing, the comics are updated between 1 to 5 times a week, with majority of them are updated once a week.
- More comics are updated on Tuesdays, Fridays, and Saturdays than other days.