A data science project about Kpop idol groups, birthdays, and loads of regex.
As some of you might know, I’ve been a fan of modern Korean pop (Kpop for short) for a really long while now. I went for my student exchange in 2019 and had a great time immersing in the culture there! Towards the end of my exchange, I decided to follow one interesting trend.
In Korea, pop stars or idols’ birthdays are treated by fans really seriously. Groups of dedicated fans often dedicate time and effort to organize elaborate birthday events, usually held at cafes in Seoul (or even other major cities).
Each event is unique, and I started visiting cafes whenever it was the birthday of an artist I was interested in (and when you’re a fan of many groups, that’s actually really frequent). Usually, (fan-made) merchandise is given out as well!
After coming back, I became curious on the birthdays of idol group members and the possibility of covering as many events as possible in a specific date range. (As I don’t live in Korea, it was interesting to me to see whose events would be up when I visit.) In particular, I started to ponder the following questions:
From there, I quickly set out on a data science project to answer my two questions. However… what I thought was a quick and easy web scraping and data wrangling task quickly turned into an exercise on regex, dealing with multiple data types, and other problems surrounding data cleaning. Nonetheless, it turned out to be a good experience in data cleaning (and possibly a sneak peek to my future work in data science), hence I documented my entire process as I felt it was a great learning experience worth sharing.
For this project, the usual dplyr and tidyverse libraries were used for data wrangling. I also used rvest for website scraping, and lubridate for its processing functions as the project would involve datetime data.
Fortunately, the data I need (at least for the popular artists) is well-documented by an ever-increasing community of dedicated supporters.
The website KProfiles, which compiles crowd-sourced information on a great variety of Korean pop artists, also conveniently provides a rather comprehensive list of artists’ names, groups, and birthdays. Also, the website is in a simple HTML format, which we can easily scrape using the rvest library.
url <- "https://kprofiles.com/kpop-birthdays/"
text_raw <- read_html(url) %>% html_nodes('p') %>% html_text()
# Must unlist after every split
text <-
text_raw[2:371] %>% # Remove other body text (not needed)
strsplit(., '\\\n') %>% # Remove \n
unlist() %>%
strsplit('(?<=\\d{4})(?=[[:alpha:]])', perl=TRUE) %>% # Split e.g. '1993NextPersonName'
unlist() %>%
strsplit('(?<=\\d{4})\\s+(?=[[:alpha:]])', perl=TRUE) %>% # Split e.g. '1993 NextPersonName'
unlist() %>%
strsplit('(?<=\\))(?=[[:alpha:]])', perl=TRUE) %>% # Split e.g. ')NextPersonName'
unlist() %>%
str_squish() # Remove unnecessary spaces
The raw HTML text extracted is messy, but there are some patterns which we can use to split the entire mess into separate “entries”, each ideally representing one artist.
[1] "Jisoo (BlackPink) – Jan 3, 1995"
Most entries generally follow the format Name (Group) - Birthday. This makes it easy to separate each text string into their respective fields, from which we can create a dataframe.
idol_group_birthday <- function(text) {
i_g_b <-
text %>%
strsplit('\\s[–-]\\s') %>% # Split to 'Name (Group)' 'Birthday'
unlist() %>%
strsplit('\\s\\(', perl=TRUE) %>% # Split to 'Name' 'Group)' 'Birthday'
unlist() %>%
gsub('\\)$', '', .) %>% # Remove extra close bracket ')'
str_squish()
return(i_g_b)
}
idol_birthdays <- list(name = rep(NA, length(text)),
group = rep(NA, length(text)),
birthday = rep(NA, length(text)))
for (i in 1:length(text)) {
i_g_b <- idol_group_birthday(text[i])
idol_birthdays$name[i] <- i_g_b[1]
idol_birthdays$group[i] <- i_g_b[2]
idol_birthdays$birthday[i] <- i_g_b[3]
}
# Convert list to dataframe
idol_birthdays <- idol_birthdays %>% as.data.frame()
rm(i, i_g_b)
In our dataframe, each row represents one artist.
| name | group | birthday |
|---|---|---|
| Lee Sungmin | Super Junior | Jan 1, 1986 |
This method, however, results in some troublesome entries, for instance this row which probably originated from text used as section headers.
| name | group | birthday |
|---|---|---|
| January | Capricorn; Aquarius | NA |
We have to be careful when removing these unwanted rows. For instance, there are artists named May and also a group called April. To remove only the correct rows, we have to identify the rows whose name field is simply a month name (e.g. January) and have an empty birthday field. This avoids unintentionally grabbing the artists/groups with the same names as months. To help us, we can use the built-in datasets month.name (full names of months) and month.abb (abbreviated names of months).
For some entries, the group field is still mixed up and the birthday field are still missing.
| name | group | birthday |
|---|---|---|
| Chungha | Soloist, I.O.I)- Feb 9, 1996 | NA |
In particular, their text strings have commas at various positions from which we can then identify the group and birthday portions. From there, a function can be written to tidy up these fields.
Some artists are also identified as not being part of any group (i.e. solo artists), so the function also helps in standardizing their group field. (We will need this for later.)
# Separate group name and birthday
group_birthday <- function(text) {
for (month in months) {
pos <-
text %>%
gregexpr(month, .) %>%
unlist()
if (pos > 0) {
break
}
}
if (pos == 1) {
g_b <- list(group = 'Soloist',
birthday = substr(text, pos, nchar(text)))
} else if (pos > 1) {
g_b <- list(group = substr(text, 1, pos-2),
birthday = substr(text, pos, nchar(text)))
}
return(g_b)
}
# Apply function to clean up
for (id in which(!complete.cases(idol_birthdays))) {
g_b <- group_birthday(idol_birthdays$group[id])
idol_birthdays$group[id] <- g_b$group
idol_birthdays$birthday[id] <- g_b$birthday
}
rm(id, g_b)
This manages to fill up all fields for all rows! We can move on to tidying up the text data inside our fields.
In this section we will see extensive use of the gsub (text substitution) function and regex (regular expressions) - to “catch” specific instances of text and make the appropriate substitutions or removals accordingly.
The first step is to do a general tidying-up of the group names. Some group names end up with weird punctuation, which we have to remove. This, however, affects groups (namely f(x) and (G)I-DLE) so we have to return them to their correct stylized spelling.
There is also a particular group which goes interchangeably by TVXQ and DBSK - both names are mentioned so to standardize the latter is removed.
Some artists are also wrongly classified as soloists by the (above) group_birthday function. They have to be manually corrected, to prevent them from disappearing later. (We will sadly be removing soloists afterwards…)
idol_birthdays[171,]$group <- 'DIA'
idol_birthdays[1565,]$name <- 'Seoyul'
idol_birthdays[1565,]$group <- 'Berry Good'
The first challenge comes from artists who are/were part of multiple groups, or perform multiple roles e.g. doubling up as actors/actresses. This means our database would identify, say, Yeonjung as a different group from her Cosmic Girls friends as their entries in the group field would be different.
| name | group | birthday |
|---|---|---|
| Yeonjung | Cosmic Girls, I.O.I | Aug 3, 1999 |
| Kahi | Former After School/Actress | Dec 25, 1980 |
Fortunately, their group field all follows the same pattern - a comma or slash separator!
multiple_groups <- function(df) {
df <- filter(df, grepl(',', group) | grepl('/', group))
df_output <- data.frame()
for (i in 1:nrow(df)) {
groups <- df$group[i] %>% strsplit('/') %>% unlist() %>% strsplit(',') %>% unlist()
idol_df <- data.frame(name = rep(df$name[i], length(groups)),
group = groups,
birthday = rep(df$birthday[i], length(groups)))
df_output <- rbind(df_output, idol_df)
}
return(df_output)
}
# Perform function for those rows, append to original dataframe, remove original rows
idol_birthdays <-
rbind(idol_birthdays[-which(grepl(',', idol_birthdays$group) | grepl('/', idol_birthdays$group)), ],
multiple_groups(idol_birthdays))
We can then address this by separating each artist with multiple groups into separate entries, each representing one group that he/she is/was in.
| name | group | birthday |
|---|---|---|
| Yeonjung | Cosmic Girls | Aug 3, 1999 |
| Yeonjung | I.O.I | Aug 3, 1999 |
This also follows the first normal form (1NF) in database normalization, though this results in duplicate data and hence requires further normalization. Thankfully there are too few of such instances in our data to slow our processing down in a noticeable way, so I chose to leave the dataset at that.
In the resulting dataset, we have with a lot of artists who aren’t actually part of any particular group. Or, some of them had solo projects whilst/after being part of a group. This makes things tricky. Do we count the purely solo artists? Do we count those involved in both solo and group activities?
Well, since the main objective of the project revolves around groups, I decided to exclude all soloists - that is, all rows where the group field indicates that the artist is a soloist. This effectively removes all purely solo artists, and also disregards the solo activities of those who are double-hatting. (Contextually, it can also get quite debatable on what makes an “idol” singer and what doesn’t, so excluding all of them would make things easier.) There are also Actor and Actress entries turning up in the group field, presumably for those idol-turned-actors and actresses, which are also not relevant to this project and hence removed too.
idol_birthdays <-
idol_birthdays[-which(idol_birthdays$group == 'Solo' |
idol_birthdays$group == 'Soloist' |
idol_birthdays$group == 'Solo Singer' |
idol_birthdays$group == 'Solist/' |
idol_birthdays$group == 'Actor' |
idol_birthdays$group == 'Actress'), ]
The next challenge comes in the form of ex-members of groups - people come and go, group rosters don’t stay the same forever - so who do we count as actual “members” at this point of time?
| name | group | birthday |
|---|---|---|
| Hyuna | Former Wonder Girls | June 6, 1992 |
| Hyuna | 4minute | June 6, 1992 |
To address this, I decided to take groups at their current iteration - just for the sake of consistency. To achieve this, we can remove rows where things like “Former member of Group” or “Ex-member of Group” show up in the group field.
This gets really tricky when it comes to disbanded groups. For the purposes of this dataset, it is assumed that for disbanded groups, the data comprises of their last iteration at point of disbandment.
The following is just a series of text substitutions to clean up specific group names once and for all. Most of them are just to fix inconsistent naming (many groups’ names are stylized) or typos.
idol_birthdays$group <-
idol_birthdays$group %>%
gsub('also known as PUNCH', '1PUNCH', .) %>%
gsub('A-Peace ‘Jade‘', 'A-Peace ‘Lapis’', .) %>%
gsub('A-Peace ‘Lapis$', 'A-Peace ‘Lapis’', .) %>%
gsub('A-Peace ‘Lapis‘', 'A-Peace ‘Lapis’', .) %>%
gsub('B2st', 'Beast', .) %>%
gsub('Bigflo', 'BIGFLO', .) %>%
gsub('BigFlo', 'BIGFLO', .) %>%
gsub('Boyfriendist', 'Boyfriend', .) %>%
gsub('Dalshabet', 'Dal Shabet', .) %>%
gsub('F.Cuz', 'F.CUZ', .) %>%
gsub('FT Island', 'FT. Island', .) %>%
gsub('Kara', 'KARA', .) %>%
gsub('miss A', 'Miss A', .) %>%
gsub('Moxine', 'MOXIE', .) %>%
gsub('MOXINE', 'MOXIE', .) %>%
gsub('NU-EST', 'NU’EST', .) %>%
gsub('RaNia', 'Rania', .) %>%
gsub('Satruday', 'Saturday', .) %>%
gsub('SKARF', 'SKarf', .) %>%
gsub('Touch', 'TOUCH', .) %>%
gsub('Varsity', 'VARSITY', .) %>%
str_squish()
Upon browsing through the final dataframe, there are still some mistakes, mainly from inconsistencies in the original website text which caused the previous functions to place things wrongly. There are only a couple of wrong entries at this point, so manual correction (inserting the required info obtained from a quick Google search) would do.
Another perhaps more glaring mistake would be the number of groups with only one member. (After all, how can a group have only one person?) These are usually groups with incomplete member information, and so (instead of manual Google searches to add in the required data) I just decided to drop these groups from the dataset.
This can be done by a group_by and summarize (to get the sizes of each group) followed by joining the result with the original dataset and then filtering out those groups with group size 1.
We are done with the group field! Now for the birthday field, which is hopefully less painful to deal with, despite having to deal with datetime objects…
The first step for the birthday field, whose entries are currently text strings, is to convert everything into the date data type.
Unfortunately, not all of our text strings are consistent (e.g. there might have been a 13 Sep, 1998 and a 16th September 1987). Hence we have to address this before we do our conversion.
# Standardize the dates
standard_date <- function(text) {
b <-
text %>%
gsub('\\,', ' ', .) %>% # Remove commas
gsub('th', '', .) %>% # Remove 'th' (not recognized)
gsub('Sept', 'Sep', .) %>% # Change 'Sept' (not recognized) to 'Sep'
str_squish() %>%
as.Date(format('%b %d %Y'))
return(b)
}
idol_birthdays <-
mutate(idol_birthdays, birthday = standard_date(birthday))
Now that the birthday field is standardized, we can sort the dataframe!
However, simply using arrange will sort by year-month-day. To sort the data by month-day (as in the original website), requires a slightly more complicated method:
month and day columns - extract the birth month and day from each birthday using the mutate function.month and day columns.month and day (as we don’t need them anymore).The lubridate library is quite useful in extracting month and day from an entry with date datatype.
| name | group | birthday | |
|---|---|---|---|
| 1 | Carla | MOXIE | 1990-01-01 |
| 2 | Jean Paul | BTL | 1991-01-01 |
| 3 | Kim Seunghwan | A-Peace ‘Jade’ | 1994-01-01 |
| 4 | Kun | NCT U | 1996-01-01 |
| 5 | Lee Sungmin | Super Junior | 1986-01-01 |
| 6 | Mimi | Gugudan | 1993-01-01 |
| 7 | Yuri | O21 | 1996-01-01 |
| 8 | Ferlyn | SKarf | 1992-01-02 |
| 9 | Jinhong | 24K | 1998-01-02 |
| 10 | Lee Jeongmin | Boyfriend | 1994-01-02 |
And we are finally done! All the name, group, and birthday fields are filled and consistent. (I did a manual double-check, but only for the groups I personally know. Hopefully this did not end up deleting any entries that should have remained!)
Our final dataset has 1651 rows, representing 1612 unique artists in 282 groups.
Now that all our data is cleaned, we can proceed with some visualization before heading back to our project questions. The workflow is quite straightforward, mainly involving some grouping via group_by (if needed), followed by aggregation using summarize.
For starters, let’s investigate the diversity in group sizes. We do this by performing the group_by and summarize method twice (no pun intended) - first to getting the number of members in each idol group, then grouping everything together according to idol group size, and then obtaining the sizes of these groups.
# Filter out groups
idol_birthdays %>%
group_by(group) %>%
summarize(group_size = n()) %>%
group_by(group_size) %>%
summarize(n = n()) %>%
ggplot(aes(x=group_size, y=n, fill=group_size)) +
geom_bar(stat='identity') +
theme(legend.position = 'none') +
labs(title='Distribution of Group Sizes', x='Group Size', y='No. of Groups')
As previously mentioned, because ex-members may or may not be removed from the dataset, for quite a number of groups, the number may not be what you expect. There may also be incomplete information for certain groups hence resulting in smaller group sizes.
We group the dataset by year and count how many people are in each one.
idol_birthdays %>%
mutate(year = format(birthday, format='%Y')) %>%
group_by(year) %>%
summarize(number = n()) %>%
ggplot(aes(x=year, y=number, fill=year)) +
geom_bar(stat='identity') +
theme(axis.text.x = element_text(angle=90, hjust=1), legend.position = 'none') +
labs(title='No. of Idols by Birth Year', x='Year', y=NULL)
Same as above, but grouping by month instead.
The month field is converted to a factor data type, which treats the entries as categorical variables. This is helpful as it preserves the order of months (otherwise R will sort the x-axis alphabetically).
idol_birthdays %>%
mutate(month = factor(format(birthday, format='%b'), levels = month.abb)) %>%
group_by(month) %>%
summarize(number = n()) %>%
ggplot(aes(x=month, y=number, fill=month)) +
geom_bar(stat='identity') +
theme(legend.position = 'none') +
labs(title='No. of Idols by Birth Month', x='Month', y=NULL)
Our first question is to find out the range, or the shortest period of time to cover all birthdays in a group.
Before we get to the calculations, we need to consider two things:
# Since year is irrelevant, standardize all years to a specified non-leap year (1997)
idol_birthdays2 <- idol_birthdays
lubridate::year(idol_birthdays2$birthday) <- 1997
Then, to find the range, we first have to first visualize birthdays on a date line. Let’s imagine a fictional three-member group with birthdays January 1st, May 7th, and December 12th:
To cover all birthdays, we can see there are 3 possibilities: A to C, B to (next year’s) A, and C to (next year’s) B. Notice that each selection covers 2 “gaps” - for example if we do A to C, then we consider the “gaps” of 126 days and 219 days.
We want to cover all birthdays in the shortest amount of time. As such, what we have to do is to make the selection with the smallest “gaps”! This can be done by starting the count for our range after the largest “gap”. For the example above, this means counting from C (avoiding the largest “gap” of 219 days) to (next year’s) B, resulting in a range of 147 days (inclusive of everyone’s birthdays).
Applying this logic, we can then write the appropriate function for all groups in our dataset. We can also include in the function a way to tell us whose birthday to start counting from.
birthday_range <- function(birthdays, show_start=FALSE) {
gaps <- vector()
for (i in 1:length(birthdays)) {
if (i == length(birthdays)) {
gaps <- c(gaps, 365 + birthdays[1] - birthdays[i])
} else {
gaps <- c(gaps, birthdays[i+1] - birthdays[i])
}
}
largest_gap_index <- which.max(gaps)
if (largest_gap_index == length(birthdays)) {
start_index = 1
last_index = length(birthdays)
} else {
start_index = largest_gap_index + 1
last_index = largest_gap_index
}
# 365 - largest gap in days + 1 to include last person's birthday
range = 366 - gaps[last_index]
if (show_start) {
return(format(birthdays[start_index], '%m/%d'))
}
return(range)
}
Let’s see how the above function works for the group Twice.
# Apply function on one group
twice_birthdays <- filter(idol_birthdays2, group == 'Twice')
paste(twice_birthdays$group[1],
'has a range of',
birthday_range(twice_birthdays$birthday),
'and we start counting from',
birthday_range(twice_birthdays$birthday, show_start=TRUE))
[1] "Twice has a range of 266 and we start counting from 09/22"
We obtain a range of 266 days, and start counting from September 22 which is Nayeon’s birthday. This means that to cover all of Twice’s birthdays in the shortest amount of time, we start with Nayeon on September 22 (and end with Tzuyu on June 14) for which we will take 266 days.
| name | birthday |
|---|---|
| Jihyo | 1997-02-01 |
| Mina | 1997-03-24 |
| Chaeyoung | 1999-04-23 |
| Dahyun | 1998-05-28 |
| Tzuyu | 1999-06-14 |
| Nayeon | 1995-09-22 |
| Jungyeon | 1996-11-01 |
| Momo | 1996-11-09 |
| Sana | 1996-12-29 |
Let’s then calculate the range for all groups.
| group | range |
|---|---|
| Fly To The Sky | 8 |
| 15& | 10 |
| LC9 | 14 |
| Trax | 14 |
| Monogram | 26 |
| Twice | 266 |
| The Boyz | 296 |
| SEVENTEEN | 301 |
| Rania | 302 |
| Cosmic Girls | 317 |
Not surprisingly, the smallest numbers are for duets, for instance Fly to the Sky having two members whose birthdays are 7 days apart (hence a range of 8).
We also see that larger groups tend to have a larger range, such as Twice whose members’ birthdays are well distributed across the year (from February to December) and therefore having a range of 266.
With what we’ve got, we can indeed write a function to get the birthdays of all idols you’re interested in, given a specified date range!
Since we have a cleaned, full dataset, it’s just a matter of filtering. In particular, such a function will apply the filter function on three variables - the start_date and end_date of visit as well as the groups you are interested in.
Again, the specific year of visit does not matter, so that shall not be required in the input. But things become a bit tricky when the end_date turns out to be before the start_date (that is, arriving one year and leaving in the next). We can deal with this by simply adding 365 days to the end_date when this happens.
# start_date: string in the format 'dd/mm'
# end_date: string in the format 'dd/mm'
# groups: character vector
which_idols <- function(start_date, end_date, groups, df) {
range <-
c(start_date, end_date) %>%
paste0('/1997') %>%
as.Date(format='%d/%m/%Y')
if (range[2] < range[1]) {
range <- c(range, range[2] + 365)
celebrations <-
df %>%
filter(group %in% groups &
((birthday >= range[1] & birthday <= range[3]) | birthday <= range[2]))
} else {
celebrations <-
df %>%
filter(group %in% groups & birthday >= range[1] & birthday <= range[2])
}
celebrations <-
celebrations %>%
arrange(birthday) %>%
select(name, group) %>%
left_join(idol_birthdays, by=c('name', 'group'))
return(celebrations)
}
This assumes that your length of stay is at most 365 days (or a full year), but then if you’re present for the entire year then everyone’s birthday is automatically covered. We covered all bases!
As an example, I was in Korea from 1st September 2019 to 3rd January 2020. As a fan of the groups Twice and GFriend, let’s see whose birthdays I could have celebrated!
| name | group | birthday |
|---|---|---|
| Nayeon | TWICE | 1995-09-22 |
| Yuju | GFRIEND | 1997-10-04 |
| Jungyeon | TWICE | 1996-11-01 |
| Momo | TWICE | 1996-11-09 |
| Sowon | GFRIEND | 1995-12-07 |
| Sana | TWICE | 1996-12-29 |
In this process, you may have noticed that I was seemingly able to anticipate the problems in the data and address them sequentially. This was not actually the case - I spotted problems down the line and then retroactively applied changes. Regardless, understanding the data - specifically the context behind the data - is important! This project would not be possible had I not have certain contextual knowledge about Korean pop, some idol groups, the data source, and so on. It will not be possible to have all the knowledge regarding your data source, but at least having a rough idea of what your data should look like - based on context or past experience in the subject matter - helps a ton.
In doing this project, some decisions also had to be made on how to deal with certain aspects of the data. (Do I filter this way? Or that way?) There may not be a right or wrong method - so then the best way is to go according to what can bring us closer our project objectives.
This project started out as a simple web scraping and data wrangling task (or so I thought) - but quickly turned into an exercise on regex, dealing with multiple data types, and other problems surrounding data cleaning. Well, this seems like a sneak peak of what’s to come in my future data science work!