As some of you might know, I’ve been a fan of modern Korean pop (Kpop for short) for a really long while now. I went for my student exchange in 2019 and had a great time immersing in the culture there! Towards the end of my exchange, I decided to follow one interesting trend.

In Korea, pop stars or idols’ birthdays are treated by fans really seriously. Groups of dedicated fans often dedicate time and effort to organize elaborate birthday events, usually held at cafes in Seoul (or even other major cities).

Fans also do “birthday ads” for their idols on subway stations. I found this one, for YoungK/Brian from DAY6 particularly creative.

Each event is unique, and I started visiting cafes whenever it was the birthday of an artist I was interested in (and when you’re a fan of many groups, that’s actually really frequent). Usually, (fan-made) merchandise is given out as well!

It was surprisingly easy for me to know where and when to go, as they spread the word through Twitter, and all you need to do is to search the correct keywords or hashtags (though everything’s in Korean so you need to be familiar with it).

After coming back, I became curious on the birthdays of idol group members and the possibility of covering as many events as possible in a specific date range. (As I don’t live in Korea, it was interesting to me to see whose events would be up when I visit.) In particular, I started to ponder the following questions:

What is the shortest time I can cover all member birthdays for a particular idol group?
How can I get the birthdays of all idols I’m interested in, within a specified date range?

From there, I quickly set out on a data science project to answer my two questions. However… what I thought was a quick and easy web scraping and data wrangling task quickly turned into an exercise on regex, dealing with multiple data types, and other problems surrounding data cleaning. Nonetheless, it turned out to be a good experience in data cleaning (and possibly a sneak peek to my future work in data science), hence I documented my entire process as I felt it was a great learning experience worth sharing.

Setup

For this project, the usual dplyr and tidyverse libraries were used for data wrangling. I also used rvest for website scraping, and lubridate for its processing functions as the project would involve datetime data.

library(dplyr)
library(rvest)
library(tidyverse)
library(lubridate)

Importing Data

Fortunately, the data I need (at least for the popular artists) is well-documented by an ever-increasing community of dedicated supporters.

The website KProfiles, which compiles crowd-sourced information on a great variety of Korean pop artists, also conveniently provides a rather comprehensive list of artists’ names, groups, and birthdays. Also, the website is in a simple HTML format, which we can easily scrape using the rvest library.

url <- "https://kprofiles.com/kpop-birthdays/"

text_raw <- read_html(url) %>% html_nodes('p') %>% html_text()

# Must unlist after every split
text <-
  text_raw[2:371] %>% # Remove other body text (not needed)
  strsplit(., '\\\n') %>% # Remove \n
  unlist() %>%
  strsplit('(?<=\\d{4})(?=[[:alpha:]])', perl=TRUE) %>% # Split e.g. '1993NextPersonName'
  unlist() %>%
  strsplit('(?<=\\d{4})\\s+(?=[[:alpha:]])', perl=TRUE) %>% # Split e.g. '1993 NextPersonName'
  unlist() %>%
  strsplit('(?<=\\))(?=[[:alpha:]])', perl=TRUE) %>% # Split e.g. ')NextPersonName'
  unlist() %>%
  str_squish() # Remove unnecessary spaces

The raw HTML text extracted is messy, but there are some patterns which we can use to split the entire mess into separate “entries”, each ideally representing one artist.

[1] "Jisoo (BlackPink) – Jan 3, 1995"

Data Cleaning

Tidying up the Fields

Most entries generally follow the format Name (Group) - Birthday. This makes it easy to separate each text string into their respective fields, from which we can create a dataframe.

idol_group_birthday <- function(text) {
  i_g_b <-
    text %>%
    strsplit('\\s[–-]\\s') %>% # Split to 'Name (Group)' 'Birthday'
    unlist() %>%
    strsplit('\\s\\(', perl=TRUE) %>% # Split to 'Name' 'Group)' 'Birthday'
    unlist() %>%
    gsub('\\)$', '', .) %>% # Remove extra close bracket ')'
    str_squish()
  
  return(i_g_b)
}

idol_birthdays <- list(name = rep(NA, length(text)),
                       group = rep(NA, length(text)),
                       birthday = rep(NA, length(text)))

for (i in 1:length(text)) {
  i_g_b <- idol_group_birthday(text[i])
  idol_birthdays$name[i] <- i_g_b[1]
  idol_birthdays$group[i] <- i_g_b[2]
  idol_birthdays$birthday[i] <- i_g_b[3]
}

# Convert list to dataframe
idol_birthdays <- idol_birthdays %>% as.data.frame()

rm(i, i_g_b)

In our dataframe, each row represents one artist.

name	group	birthday
Lee Sungmin	Super Junior	Jan 1, 1986

This method, however, results in some troublesome entries, for instance this row which probably originated from text used as section headers.

name	group	birthday
January	Capricorn; Aquarius	NA

We have to be careful when removing these unwanted rows. For instance, there are artists named May and also a group called April. To remove only the correct rows, we have to identify the rows whose name field is simply a month name (e.g. January) and have an empty birthday field. This avoids unintentionally grabbing the artists/groups with the same names as months. To help us, we can use the built-in datasets month.name (full names of months) and month.abb (abbreviated names of months).

months <- c(month.name, month.abb)

idol_birthdays <-
  idol_birthdays[!(idol_birthdays$name %in% months
                   & is.na(idol_birthdays$birthday)), ]

For some entries, the group field is still mixed up and the birthday field are still missing.

name	group	birthday
Chungha	Soloist, I.O.I)- Feb 9, 1996	NA

In particular, their text strings have commas at various positions from which we can then identify the group and birthday portions. From there, a function can be written to tidy up these fields.

Some artists are also identified as not being part of any group (i.e. solo artists), so the function also helps in standardizing their group field. (We will need this for later.)

# Separate group name and birthday
group_birthday <- function(text) {
  for (month in months) {
    pos <- 
      text %>%
      gregexpr(month, .) %>%
      unlist()
    
    if (pos > 0) {
      break
    }
  }
  
  if (pos == 1) {
    g_b <- list(group = 'Soloist',
                birthday = substr(text, pos, nchar(text)))
  } else if (pos > 1) {
    g_b <- list(group = substr(text, 1, pos-2),
                birthday = substr(text, pos, nchar(text)))
  }
  
  return(g_b)
}

# Apply function to clean up
for (id in which(!complete.cases(idol_birthdays))) {
  g_b <- group_birthday(idol_birthdays$group[id])
  
  idol_birthdays$group[id] <- g_b$group
  idol_birthdays$birthday[id] <- g_b$birthday
}

rm(id, g_b)

This manages to fill up all fields for all rows! We can move on to tidying up the text data inside our fields.

Tidying up the Text

In this section we will see extensive use of the gsub (text substitution) function and regex (regular expressions) - to “catch” specific instances of text and make the appropriate substitutions or removals accordingly.

Some Quick Processing

The first step is to do a general tidying-up of the group names. Some group names end up with weird punctuation, which we have to remove. This, however, affects groups (namely f(x) and (G)I-DLE) so we have to return them to their correct stylized spelling.

There is also a particular group which goes interchangeably by TVXQ and DBSK - both names are mentioned so to standardize the latter is removed.

idol_birthdays$group <-
  idol_birthdays$group %>%
  gsub('\\/DBSK', '', .)  %>%
  gsub('\\)[-;]', '', .) %>%
  gsub('\\–', '', .) %>%
  gsub('\\)', '', .) %>%
  gsub('f\\(x', 'f(x)', .) %>%
  gsub('GI-DLE', '(G)I-DLE', .)

Some artists are also wrongly classified as soloists by the (above) group_birthday function. They have to be manually corrected, to prevent them from disappearing later. (We will sadly be removing soloists afterwards…)

idol_birthdays[171,]$group <- 'DIA'
idol_birthdays[1565,]$name <- 'Seoyul'
idol_birthdays[1565,]$group <- 'Berry Good'

The Multiple Group Conundrum

The first challenge comes from artists who are/were part of multiple groups, or perform multiple roles e.g. doubling up as actors/actresses. This means our database would identify, say, Yeonjung as a different group from her Cosmic Girls friends as their entries in the group field would be different.

name	group	birthday
Yeonjung	Cosmic Girls, I.O.I	Aug 3, 1999
Kahi	Former After School/Actress	Dec 25, 1980

Fortunately, their group field all follows the same pattern - a comma or slash separator!

multiple_groups <- function(df) {
  df <- filter(df, grepl(',', group) | grepl('/', group))
  
  df_output <- data.frame()
  
  for (i in 1:nrow(df)) {
    groups <- df$group[i] %>% strsplit('/') %>% unlist() %>% strsplit(',') %>% unlist()
    idol_df <- data.frame(name = rep(df$name[i], length(groups)),
                          group = groups,
                          birthday = rep(df$birthday[i], length(groups)))
    
    df_output <- rbind(df_output, idol_df)
  }
  
  return(df_output)
}

# Perform function for those rows, append to original dataframe, remove original rows
idol_birthdays <-
  rbind(idol_birthdays[-which(grepl(',', idol_birthdays$group) | grepl('/', idol_birthdays$group)), ],
        multiple_groups(idol_birthdays))

We can then address this by separating each artist with multiple groups into separate entries, each representing one group that he/she is/was in.

name	group	birthday
Yeonjung	Cosmic Girls	Aug 3, 1999
Yeonjung	I.O.I	Aug 3, 1999

This also follows the first normal form (1NF) in database normalization, though this results in duplicate data and hence requires further normalization. Thankfully there are too few of such instances in our data to slow our processing down in a noticeable way, so I chose to leave the dataset at that.

The Soloist Conundrum

In the resulting dataset, we have with a lot of artists who aren’t actually part of any particular group. Or, some of them had solo projects whilst/after being part of a group. This makes things tricky. Do we count the purely solo artists? Do we count those involved in both solo and group activities?

Well, since the main objective of the project revolves around groups, I decided to exclude all soloists - that is, all rows where the group field indicates that the artist is a soloist. This effectively removes all purely solo artists, and also disregards the solo activities of those who are double-hatting. (Contextually, it can also get quite debatable on what makes an “idol” singer and what doesn’t, so excluding all of them would make things easier.) There are also Actor and Actress entries turning up in the group field, presumably for those idol-turned-actors and actresses, which are also not relevant to this project and hence removed too.

idol_birthdays <-
  idol_birthdays[-which(idol_birthdays$group == 'Solo' |
                          idol_birthdays$group == 'Soloist' |
                          idol_birthdays$group == 'Solo Singer' |
                          idol_birthdays$group == 'Solist/' |
                          idol_birthdays$group == 'Actor' |
                          idol_birthdays$group == 'Actress'), ]

The Ex-member Conundrum

The next challenge comes in the form of ex-members of groups - people come and go, group rosters don’t stay the same forever - so who do we count as actual “members” at this point of time?

name	group	birthday
Hyuna	Former Wonder Girls	June 6, 1992
Hyuna	4minute	June 6, 1992

To address this, I decided to take groups at their current iteration - just for the sake of consistency. To achieve this, we can remove rows where things like “Former member of Group” or “Ex-member of Group” show up in the group field.

idol_birthdays$group <-
  idol_birthdays$group %>%
  gsub('[fF]ormer.*', '', .) %>%
  gsub('[eE]x.*', '', .)

idol_birthdays <- 
  idol_birthdays[!idol_birthdays$group == '', ]

This gets really tricky when it comes to disbanded groups. For the purposes of this dataset, it is assumed that for disbanded groups, the data comprises of their last iteration at point of disbandment.

Further Manual Cleaning

The following is just a series of text substitutions to clean up specific group names once and for all. Most of them are just to fix inconsistent naming (many groups’ names are stylized) or typos.

idol_birthdays$group <-
  idol_birthdays$group %>%
  gsub('also known as PUNCH', '1PUNCH', .) %>%
  gsub('A-Peace ‘Jade‘', 'A-Peace ‘Lapis’', .) %>%
  gsub('A-Peace ‘Lapis$', 'A-Peace ‘Lapis’', .) %>%
  gsub('A-Peace ‘Lapis‘', 'A-Peace ‘Lapis’', .) %>%
  gsub('B2st', 'Beast', .) %>%
  gsub('Bigflo', 'BIGFLO', .) %>%
  gsub('BigFlo', 'BIGFLO', .) %>%
  gsub('Boyfriendist', 'Boyfriend', .) %>%
  gsub('Dalshabet', 'Dal Shabet', .) %>%
  gsub('F.Cuz', 'F.CUZ', .) %>%
  gsub('FT Island', 'FT. Island', .) %>%
  gsub('Kara', 'KARA', .) %>%
  gsub('miss A', 'Miss A', .) %>%
  gsub('Moxine', 'MOXIE', .) %>%
  gsub('MOXINE', 'MOXIE', .) %>%
  gsub('NU-EST', 'NU’EST', .) %>%
  gsub('RaNia', 'Rania', .) %>%
  gsub('Satruday', 'Saturday', .) %>%
  gsub('SKARF', 'SKarf', .) %>%
  gsub('Touch', 'TOUCH', .) %>%
  gsub('Varsity', 'VARSITY', .) %>%
  str_squish()

Upon browsing through the final dataframe, there are still some mistakes, mainly from inconsistencies in the original website text which caused the previous functions to place things wrongly. There are only a couple of wrong entries at this point, so manual correction (inserting the required info obtained from a quick Google search) would do.

idol_birthdays$birthday[which(idol_birthdays$name == 'Samuel' & idol_birthdays$group == '1PUNCH')] <- 'Jan 17, 2002'
idol_birthdays$group[which(idol_birthdays$name == 'Aoora')] <- 'AA'
idol_birthdays$birthday[which(idol_birthdays$name == 'Aoora')] <- 'Jan 10, 1986'

Another perhaps more glaring mistake would be the number of groups with only one member. (After all, how can a group have only one person?) These are usually groups with incomplete member information, and so (instead of manual Google searches to add in the required data) I just decided to drop these groups from the dataset.

This can be done by a group_by and summarize (to get the sizes of each group) followed by joining the result with the original dataset and then filtering out those groups with group size 1.

# Filter out groups 
idol_birthdays <-
  idol_birthdays %>%
  group_by(group) %>%
  summarize(group_size = n()) %>%
  right_join(idol_birthdays, by='group') %>%
  filter(group_size > 1) %>%
  select(name, group, birthday)

We are done with the group field! Now for the birthday field, which is hopefully less painful to deal with, despite having to deal with datetime objects…

Tidying up the Dates

The first step for the birthday field, whose entries are currently text strings, is to convert everything into the date data type.

Unfortunately, not all of our text strings are consistent (e.g. there might have been a 13 Sep, 1998 and a 16th September 1987). Hence we have to address this before we do our conversion.

# Standardize the dates
standard_date <- function(text) {
  b <- 
    text %>%
    gsub('\\,', ' ', .) %>% # Remove commas
    gsub('th', '', .) %>% # Remove 'th' (not recognized)
    gsub('Sept', 'Sep', .) %>% # Change 'Sept' (not recognized) to 'Sep' 
    str_squish() %>%
    as.Date(format('%b %d %Y'))
  
  return(b)
}

idol_birthdays <-
  mutate(idol_birthdays, birthday = standard_date(birthday))

Now that the birthday field is standardized, we can sort the dataframe!

However, simply using arrange will sort by year-month-day. To sort the data by month-day (as in the original website), requires a slightly more complicated method:

Create separate month and day columns - extract the birth month and day from each birthday using the mutate function.
Sort the resulting dataframe by the newly-created month and day columns.
Select all columns of the sorted dataframe, other than month and day (as we don’t need them anymore).

The lubridate library is quite useful in extracting month and day from an entry with date datatype.

# Sort by date and tidy up row numbers
idol_birthdays <-
  idol_birthdays %>%
  mutate(month = lubridate::month(birthday),
         day = lubridate::day(birthday)) %>%
  arrange(month, day, name) %>%
  select(name, group, birthday)

	name	group	birthday
1	Carla	MOXIE	1990-01-01
2	Jean Paul	BTL	1991-01-01
3	Kim Seunghwan	A-Peace ‘Jade’	1994-01-01
4	Kun	NCT U	1996-01-01
5	Lee Sungmin	Super Junior	1986-01-01
6	Mimi	Gugudan	1993-01-01
7	Yuri	O21	1996-01-01
8	Ferlyn	SKarf	1992-01-02
9	Jinhong	24K	1998-01-02
10	Lee Jeongmin	Boyfriend	1994-01-02

And we are finally done! All the name, group, and birthday fields are filled and consistent. (I did a manual double-check, but only for the groups I personally know. Hopefully this did not end up deleting any entries that should have remained!)

Our final dataset has 1651 rows, representing 1612 unique artists in 282 groups.

Data Visualization

Now that all our data is cleaned, we can proceed with some visualization before heading back to our project questions. The workflow is quite straightforward, mainly involving some grouping via group_by (if needed), followed by aggregation using summarize.

Distribution of Group Sizes

For starters, let’s investigate the diversity in group sizes. We do this by performing the group_by and summarize method twice (no pun intended) - first to getting the number of members in each idol group, then grouping everything together according to idol group size, and then obtaining the sizes of these groups.

# Filter out groups 
idol_birthdays %>%
  group_by(group) %>%
  summarize(group_size = n()) %>%
  group_by(group_size) %>%
  summarize(n = n()) %>%
  ggplot(aes(x=group_size, y=n, fill=group_size)) +
  geom_bar(stat='identity') +
  theme(legend.position = 'none') +
  labs(title='Distribution of Group Sizes', x='Group Size', y='No. of Groups')

As previously mentioned, because ex-members may or may not be removed from the dataset, for quite a number of groups, the number may not be what you expect. There may also be incomplete information for certain groups hence resulting in smaller group sizes.

No. of Idols by Birth Year

We group the dataset by year and count how many people are in each one.

idol_birthdays %>%
  mutate(year = format(birthday, format='%Y')) %>%
  group_by(year) %>%
  summarize(number = n()) %>%
  ggplot(aes(x=year, y=number, fill=year)) +
  geom_bar(stat='identity') +
  theme(axis.text.x = element_text(angle=90, hjust=1), legend.position = 'none') +
  labs(title='No. of Idols by Birth Year', x='Year', y=NULL)

No. of Idols by Birth Month

Same as above, but grouping by month instead.

The month field is converted to a factor data type, which treats the entries as categorical variables. This is helpful as it preserves the order of months (otherwise R will sort the x-axis alphabetically).

idol_birthdays %>%
  mutate(month = factor(format(birthday, format='%b'), levels = month.abb)) %>%
  group_by(month) %>%
  summarize(number = n()) %>%
  ggplot(aes(x=month, y=number, fill=month)) +
  geom_bar(stat='identity') +
  theme(legend.position = 'none') +
  labs(title='No. of Idols by Birth Month', x='Month', y=NULL)

Back to the Problem

What is the shortest time I can cover all member birthdays for a particular idol group?

Our first question is to find out the range, or the shortest period of time to cover all birthdays in a group.

Before we get to the calculations, we need to consider two things:

We want all birthdays to be included. That is, if we have two guys whose birthdays on 1st and 2nd January, the range is 2 (hence including both their birthdays).
Since the year does not matter for our calculations (we just want to know how far apart the days are in any year), we can standardize all birth years in the dataset to some non-leap year (1997 in this case). Surprisingly, no one in the whole dataset was born on 29th February - so fortunately we don’t need to worry about leap years.

# Since year is irrelevant, standardize all years to a specified non-leap year (1997)
idol_birthdays2 <- idol_birthdays
lubridate::year(idol_birthdays2$birthday) <- 1997

Then, to find the range, we first have to first visualize birthdays on a date line. Let’s imagine a fictional three-member group with birthdays January 1st, May 7th, and December 12th:

To cover all birthdays, we can see there are 3 possibilities: A to C, B to (next year’s) A, and C to (next year’s) B. Notice that each selection covers 2 “gaps” - for example if we do A to C, then we consider the “gaps” of 126 days and 219 days.

We want to cover all birthdays in the shortest amount of time. As such, what we have to do is to make the selection with the smallest “gaps”! This can be done by starting the count for our range after the largest “gap”. For the example above, this means counting from C (avoiding the largest “gap” of 219 days) to (next year’s) B, resulting in a range of 147 days (inclusive of everyone’s birthdays).

Applying this logic, we can then write the appropriate function for all groups in our dataset. We can also include in the function a way to tell us whose birthday to start counting from.

birthday_range <- function(birthdays, show_start=FALSE) {
  gaps <- vector()
  
  for (i in 1:length(birthdays)) {
    if (i == length(birthdays)) {
      gaps <- c(gaps, 365 + birthdays[1] - birthdays[i])
    } else {
      gaps <- c(gaps, birthdays[i+1] - birthdays[i])
    }
  }
  
  largest_gap_index <- which.max(gaps)
  
  if (largest_gap_index == length(birthdays)) {
    start_index = 1
    last_index = length(birthdays)
  } else {
    start_index = largest_gap_index + 1
    last_index = largest_gap_index
  }
  
  # 365 - largest gap in days + 1 to include last person's birthday
  range = 366 - gaps[last_index]
  
  if (show_start) {
    return(format(birthdays[start_index], '%m/%d'))
  }
  
  return(range)
}

Let’s see how the above function works for the group Twice.

# Apply function on one group
twice_birthdays <- filter(idol_birthdays2, group == 'Twice')

paste(twice_birthdays$group[1],
      'has a range of',
      birthday_range(twice_birthdays$birthday),
      'and we start counting from',
      birthday_range(twice_birthdays$birthday, show_start=TRUE))

[1] "Twice has a range of 266 and we start counting from 09/22"

We obtain a range of 266 days, and start counting from September 22 which is Nayeon’s birthday. This means that to cover all of Twice’s birthdays in the shortest amount of time, we start with Nayeon on September 22 (and end with Tzuyu on June 14) for which we will take 266 days.

name	birthday
Jihyo	1997-02-01
Mina	1997-03-24
Chaeyoung	1999-04-23
Dahyun	1998-05-28
Tzuyu	1999-06-14
Nayeon	1995-09-22
Jungyeon	1996-11-01
Momo	1996-11-09
Sana	1996-12-29

Let’s then calculate the range for all groups.

# Apply function
idol_birthday_ranges <-
  idol_birthdays2 %>%
  group_by(group) %>%
  summarize(range = birthday_range(birthday)) %>%
  arrange(range)

Some of the shortest and longest ranges
group	range
Fly To The Sky	8
15&	10
LC9	14
Trax	14
Monogram	26
Twice	266
The Boyz	296
SEVENTEEN	301
Rania	302
Cosmic Girls	317

Not surprisingly, the smallest numbers are for duets, for instance Fly to the Sky having two members whose birthdays are 7 days apart (hence a range of 8).

We also see that larger groups tend to have a larger range, such as Twice whose members’ birthdays are well distributed across the year (from February to December) and therefore having a range of 266.

How can I get the birthdays of all idols I’m interested in, within a specified date range?

With what we’ve got, we can indeed write a function to get the birthdays of all idols you’re interested in, given a specified date range!

Since we have a cleaned, full dataset, it’s just a matter of filtering. In particular, such a function will apply the filter function on three variables - the start_date and end_date of visit as well as the groups you are interested in.

Again, the specific year of visit does not matter, so that shall not be required in the input. But things become a bit tricky when the end_date turns out to be before the start_date (that is, arriving one year and leaving in the next). We can deal with this by simply adding 365 days to the end_date when this happens.

# start_date: string in the format 'dd/mm'
# end_date: string in the format 'dd/mm'
# groups: character vector
which_idols <- function(start_date, end_date, groups, df) {
  range <-
    c(start_date, end_date) %>%
    paste0('/1997') %>%
    as.Date(format='%d/%m/%Y')
  
  if (range[2] < range[1]) {
    range <- c(range, range[2] + 365)
    
    celebrations <- 
      df %>%
      filter(group %in% groups &
               ((birthday >= range[1] & birthday <= range[3]) | birthday <= range[2]))
  } else {
    celebrations <- 
      df %>%
      filter(group %in% groups & birthday >= range[1] & birthday <= range[2])
  }
  
  celebrations <-
    celebrations %>%
    arrange(birthday) %>%
    select(name, group) %>%
    left_join(idol_birthdays, by=c('name', 'group')) 
  
  return(celebrations)
}

This assumes that your length of stay is at most 365 days (or a full year), but then if you’re present for the entire year then everyone’s birthday is automatically covered. We covered all bases!

As an example, I was in Korea from 1st September 2019 to 3rd January 2020. As a fan of the groups Twice and GFriend, let’s see whose birthdays I could have celebrated!

name	group	birthday
Nayeon	TWICE	1995-09-22
Yuju	GFRIEND	1997-10-04
Jungyeon	TWICE	1996-11-01
Momo	TWICE	1996-11-09
Sowon	GFRIEND	1995-12-07
Sana	TWICE	1996-12-29

Sadly, I was a bit too late into the visiting-and-collecting “hobby”, which meant that I missed most of their events. I managed to attend Sana’s one though, and here’s what I got. Thanks ONCE!

Closing

In this process, you may have noticed that I was seemingly able to anticipate the problems in the data and address them sequentially. This was not actually the case - I spotted problems down the line and then retroactively applied changes. Regardless, understanding the data - specifically the context behind the data - is important! This project would not be possible had I not have certain contextual knowledge about Korean pop, some idol groups, the data source, and so on. It will not be possible to have all the knowledge regarding your data source, but at least having a rough idea of what your data should look like - based on context or past experience in the subject matter - helps a ton.

In doing this project, some decisions also had to be made on how to deal with certain aspects of the data. (Do I filter this way? Or that way?) There may not be a right or wrong method - so then the best way is to go according to what can bring us closer our project objectives.

This project started out as a simple web scraping and data wrangling task (or so I thought) - but quickly turned into an exercise on regex, dealing with multiple data types, and other problems surrounding data cleaning. Well, this seems like a sneak peak of what’s to come in my future data science work!

Of Idols and Birthdays