South Korea has its own genre of music known as k-pop. Artists such as BTS and BLACKPINK have become more prominent in western media over the past several years, but there is a much more diverse music scene than just these artists. Korean national broadcasting companies have weekly music shows where artists come to promote new albums and singles by performing one or two songs to an international audience and there is a song winner announced that represents the most “popular” song for that week. There are six music shows that are broadcasted on a different day each week and each has their own way of calculating popularity scores:
Almost every show has the concept of a “crown” (* = triple crown, ** = quintuple crown) which denotes that once a song has won three or five times, it is ineligible from winning again. The quintuple crown wasn’t implemented until 2021 for Show! Music Core though so it doesn’t impact our analysis.
Looking at music show wins from the 2016 - 2020, are there any trends in which artists dominated the charts during this time period or what types of artists (group, solo, male, female) typically won?
This project utilizes two sources of data:
The wins information mainly from Wikipedia includes the following information:
Artist is a solo
singer/performer or a groupArtist if they’re a solo
performerThe Kaggle tables hold the following relevant information
This is a total of 4 tables to start with as the music show wins is one and the Kaggle source has three additional tables. I’ll load them down below – note that the Google sheet import requires OAuth in order to access the file.
library(googlesheets4)
# google sheets; this is the ID of the sheet, it doesn't require the full link in order to work
wins <- read_sheet ('1rIjuuurAAIdeo00GTl8wKpIyntrVARHjWG_wy4tMu9I')
# downloaded kaggle csvs
idols <- read.csv('https://raw.githubusercontent.com/addsding/data607/main/final%20project/kpop_idols.csv')
boy_groups <- read.csv('https://raw.githubusercontent.com/addsding/data607/main/final%20project/kpop_idols_boy_groups.csv')
girl_groups <- read.csv('https://raw.githubusercontent.com/addsding/data607/main/final%20project/kpop_idols_girl_groups.csv')
This section will focus on changing column names to be more readable.
library(dplyr)
wins <- wins |>
rename("show_date" = "DATE"
, "show_name" = "SHOW"
, "song_name" = "SONG"
, "artist_name" = "ARTIST"
, "sub_group" = "SUB GROUP"
, "feature_artist_name" = "FEATURING"
, "solo_group" = "SOLO OR GROUP"
, "artist_full_name" = "FULL NAME"
, "group_name_for_soloists" = "GROUP NAME"
, "points" = "POINTS"
, "is_crown_winner" = "CROWN"
)
boy_groups <- boy_groups |>
rename("group_name" = "Name"
, "short_name" = "Short"
, "korean_group_name" = "Korean.Name"
, "debut_date" = "Debut"
, "company_name" = "Company"
, "member_count" = "Members"
, "orig_member_count" = "Orig..Memb."
, "fanclub_name" = "Fanclub.Name"
, "is_active" = "Active"
)
girl_groups <- girl_groups |>
rename("group_name" = "Name"
, "short_name" = "Short"
, "korean_group_name" = "Korean.Name"
, "debut_date" = "Debut"
, "company_name" = "Company"
, "member_count" = "Members"
, "orig_member_count" = "Orig..Memb."
, "fanclub_name" = "Fanclub.Name"
, "is_active" = "Active"
)
idols <- idols |>
rename("stage_name" = "Stage.Name"
, "full_name" = "Full.Name"
, "korean_name" = "Korean.Name"
, "korean_stage_name" = "K..Stage.Name"
, "birth_date" = "Date.of.Birth"
, "group_name" = "Group"
, "nationality_country" = "Country"
, "birth_place" = "Birthplace"
, "other_groups" = "Other.Group"
, "gender" = "Gender"
)
The dataframe I’d like to end up with uses wins as the
base and essentially just joins artist/group info so we can see who
typically wins music shows. We’ll start by combining
boy_groups and girl_groups, selecting the
relevant information, and also creating columns to join on where the
strings are all lower case. We’ll also modify the
debut_date column to be an actual date instead of a
chr type.
library(stringr)
# adding a gender column for both dataframes
boy_groups <- boy_groups |> mutate(gender = "M")
girl_groups <- girl_groups |> mutate(gender = "F")
# union the two dataframes together and select only relevant columns
all_groups <- rbind(boy_groups, girl_groups) |>
select(group_name, debut_date, company_name, member_count, orig_member_count, is_active, gender)
# make a new column that is the lowercase version of group_name (joining field)
all_groups <- all_groups |> mutate(group_name_lower = str_to_lower(group_name))
# transform the debut_date column into a date type
all_groups$debut_date <- as.Date(all_groups$debut_date, "%Y-%m-%d")
head(all_groups)
## group_name debut_date company_name member_count orig_member_count is_active
## 1 100% 2012-09-18 TOP Media 4 7 Yes
## 2 14U 2017-04-17 BG 14 14 Yes
## 3 1the9 2019-02-09 MBK 9 9 Yes
## 4 24K 2012-09-06 Choeun 8 6 Yes
## 5 2AM 2008-06-21 JYP, Big Hit 4 4 No
## 6 2PM 2008-07-04 JYP 6 7 Yes
## gender group_name_lower
## 1 M 100%
## 2 M 14u
## 3 M 1the9
## 4 M 24k
## 5 M 2am
## 6 M 2pm
Next, we’ll be taking subsets of the Wikipedia data to prep for
joining. As we have one table (all_groups) holding group
information and then one table (idols) for soloists, we’ll
have to eventually do two joins. We’re also making all the join columns
lower case for simplicity.
Note that wins had rows for days where there weren’t
shows being broadcasted or no winning song; by applying these filters in
this section, we’re removing these rows. This means that the ending
dataframe will likely have less rows than the one we began with
(wins).
# make the artist name and group name info lower case
wins <- wins |> mutate(artist_name_lower = str_to_lower(artist_name)
, artist_full_name_lower = str_to_lower(artist_full_name)
, group_name_for_soloists_lower = str_to_lower(group_name_for_soloists)
)
# change sub_group into a character column and replace all cases of NULL with ''
wins$sub_group <- as.character(wins$sub_group)
wins$sub_group[wins$sub_group == 'NULL'] <- NA
# make sure that show_date is formatted as a date
wins$show_date <- as.Date(wins$show_date, "%Y-%m-%d")
# create separate dataframes for group and solo
group_wins <- wins |> filter(solo_group == "G")
solo_wins <- wins |> filter(solo_group == "S")
head(group_wins)
## # A tibble: 6 × 14
## show_date show_name song_name artist_name sub_group feature_artist_name
## <date> <chr> <chr> <chr> <chr> <chr>
## 1 2016-01-03 Inkigayo Again Turbo <NA> <NA>
## 2 2016-01-08 Music Bank Run BTS <NA> <NA>
## 3 2016-01-14 MCountdown Dumb & Dumber iKON <NA> <NA>
## 4 2016-01-21 MCountdown Dumb & Dumber iKON <NA> <NA>
## 5 2016-01-26 The Show Warning Sign Teen Top <NA> <NA>
## 6 2016-01-28 MCountdown Dumb & Dumber iKON <NA> <NA>
## # ℹ 8 more variables: solo_group <chr>, artist_full_name <chr>,
## # group_name_for_soloists <chr>, points <dbl>, is_crown_winner <lgl>,
## # artist_name_lower <chr>, artist_full_name_lower <chr>,
## # group_name_for_soloists_lower <chr>
With these two subsets created now, we’ll start with
group_wins – this is joined to all_groups
based on artist_name. This will be a left join with
group_wins as the base and all_groups
appending all its data.
group_wins_full <- group_wins |> left_join(all_groups, by = join_by(artist_name_lower == group_name_lower))
# find how many NAs resulted from this
sum(is.na(group_wins_full$group_name))
## [1] 40
40 nulls – which groups don’t exist in our Kaggle dataset?
group_wins_full |>
group_by(artist_name) |>
filter(is.na(group_name)) |>
summarise(count = n(),
.groups = 'drop') |>
arrange(desc(count))
## # A tibble: 10 × 2
## artist_name count
## <chr> <int>
## 1 X1 11
## 2 IOI 9
## 3 AKMU 6
## 4 SECHSKIES 4
## 5 Epik High 3
## 6 SSAK3 3
## 7 10cm 1
## 8 Dynamic Duo 1
## 9 Hyukoh 1
## 10 Noel 1
As there are only 10 groups missing, just adding this data to the dataframe can fix this. An alternative fix is to just go into the .csv and modify it myself, but I’d like to attempt to do it in R.
# create the lists containing all the data
group_x1 <- list(group_name = "X1"
, debut_date = as.Date("2019-08-27", "%Y-%m-%d")
, company_name = "Stone"
, member_count = 11
, orig_member_count = 11
, is_active = "No"
, gender = "M"
, group_name_lower = "x1"
)
group_ioi <- list(group_name = "IOI"
, debut_date = as.Date("2016-05-04", "%Y-%m-%d")
, company_name = "YMC"
, member_count = 11
, orig_member_count = 11
, is_active = "No"
, gender = "M"
, group_name_lower = "ioi"
)
group_akmu <- list(group_name = "AKMU"
, debut_date = as.Date("2014-04-07", "%Y-%m-%d")
, company_name = "YG"
, member_count = 2
, orig_member_count = 2
, is_active = "Yes"
, gender = "MF"
, group_name_lower = "akmu"
)
group_sechskies <- list(group_name = "SECHSKIES"
, debut_date = as.Date("1997-04-97", "%Y-%m-%d")
, company_name = "YG"
, member_count = 4
, orig_member_count = 6
, is_active = "Yes"
, gender = "M"
, group_name_lower = "sechskies"
)
group_epikhigh <- list(group_name = "Epik High"
, debut_date = as.Date("2003-10-23", "%Y-%m-%d")
, company_name = "YG"
, member_count = 3
, orig_member_count = 3
, is_active = "Yes"
, gender = "M"
, group_name_lower = "epik high"
)
group_ssak3 <- list(group_name = "SSAK3"
, debut_date = as.Date("2020-07-25", "%Y-%m-%d")
, company_name = "MBC"
, member_count = 3
, orig_member_count = 3
, is_active = "No"
, gender = "MF"
, group_name_lower = "ssak3"
)
group_10cm <- list(group_name = "10cm"
, debut_date = as.Date("2010-04-10", "%Y-%m-%d")
, company_name = "Ten Music"
, member_count = 1
, orig_member_count = 2
, is_active = "Yes"
, gender = "M"
, group_name_lower = "10cm"
)
group_dynamicduo <- list(group_name = "Dynamic Duo"
, debut_date = as.Date("2004-05-17", "%Y-%m-%d")
, company_name = "Amoeba Culture"
, member_count = 2
, orig_member_count = 2
, is_active = "Yes"
, gender = "M"
, group_name_lower = "dynamic duo"
)
group_hyukoh <- list(group_name = "Hyukoh"
, debut_date = as.Date("2014-09-18", "%Y-%m-%d")
, company_name = "YG"
, member_count = 4
, orig_member_count = 4
, is_active = "Yes"
, gender = "M"
, group_name_lower = "hyukoh"
)
group_noel <- list(group_name = "Noel"
, debut_date = as.Date("2002-12-16", "%Y-%m-%d")
, company_name = "JYP"
, member_count = 4
, orig_member_count = 4
, is_active = "Yes"
, gender = "M"
, group_name_lower = "noel"
)
# append all of these groups to all_groups
all_groups <-
rbind(
rbind(
rbind(
rbind(
rbind(
rbind(
rbind(
rbind(
rbind(
rbind(all_groups, group_10cm)
, group_akmu)
, group_dynamicduo)
, group_epikhigh)
, group_hyukoh)
, group_ioi)
, group_noel)
, group_sechskies)
, group_ssak3)
, group_x1)
# redo the join and verify there are no more NAs
group_wins_full <- group_wins |> left_join(all_groups, by = join_by(artist_name_lower == group_name_lower))
sum(is.na(group_wins_full$group_name))
## [1] 0
No more NAs!
Now, we’ll do a join for the soloist wins – this will entail us
joining by artist_full_name as well as
group_name_for_soloists to ensure we get the right person
as there could be cases with artists having the same name.
# make group_name 'Solo' for those without a group name
idols$group_name[idols$group_name == ""] <- "Solo"
# make the fields in idols for joining (lower case versions of two other fields) and select relevant fields
idols <- idols |> mutate(full_name_lower = str_to_lower(full_name)
, group_name_lower = str_to_lower(group_name)
) |>
select(stage_name, full_name, gender, full_name_lower, group_name_lower)
solo_wins_full <- left_join(solo_wins, idols, by = c('artist_full_name_lower' = 'full_name_lower'
, 'group_name_for_soloists_lower' = 'group_name_lower')
)
sum(is.na(solo_wins_full$stage_name))
## [1] 52
52 nulls – who’s missing?
solo_wins_full |>
group_by(artist_name) |>
filter(is.na(stage_name)) |>
summarise(count = n(),
.groups = 'drop') |>
summarise(count = n())
## # A tibble: 1 × 1
## count
## <int>
## 1 23
As there are 23 missing artists here, I’ll opt to add these directly
to the idols .csv and reload that file. I’ll also only be adding
relevant data to avoid researching too many irrelevant things (birth
place, birthday, etc.) before rejoining it to
solo_wins.
idols <- read.csv('https://raw.githubusercontent.com/addsding/data607/main/final%20project/kpop_idols_v2.csv')
idols <- idols |>
rename("stage_name" = "Stage.Name"
, "full_name" = "Full.Name"
, "korean_name" = "Korean.Name"
, "korean_stage_name" = "K..Stage.Name"
, "birth_date" = "Date.of.Birth"
, "group_name" = "Group"
, "nationality_country" = "Country"
, "birth_place" = "Birthplace"
, "other_groups" = "Other.Group"
, "gender" = "Gender"
)
idols$group_name[idols$group_name == ""] <- "Solo"
idols <- idols |> mutate(full_name_lower = str_to_lower(full_name)
, group_name_lower = str_to_lower(group_name)
) |>
select(stage_name, full_name, gender, full_name_lower, group_name_lower)
solo_wins_full <- left_join(solo_wins, idols, by = c('artist_full_name_lower' = 'full_name_lower'
, 'group_name_for_soloists_lower' = 'group_name_lower')
)
sum(is.na(solo_wins_full$stage_name))
## [1] 0
No more nulls!
Now let’s join debut_date to the appropriate soloists
(only if they’re part of a group).
solo_wins_full <- left_join(solo_wins_full
, all_groups |> select(group_name_lower, debut_date)
, by = c('group_name_for_soloists_lower' = 'group_name_lower')) |>
select(show_date
, show_name
, song_name
, artist_name
, sub_group
, feature_artist_name
, solo_group
, points
, is_crown_winner
, debut_date
, gender
)
head(solo_wins_full)
## # A tibble: 6 × 11
## show_date show_name song_name artist_name sub_group feature_artist_name
## <date> <chr> <chr> <chr> <chr> <chr>
## 1 2016-01-07 MCountdown Daddy Psy <NA> <NA>
## 2 2016-01-10 Inkigayo Lonely Night Gary <NA> <NA>
## 3 2016-01-15 Music Bank Dream Suzy <NA> Baekhyun
## 4 2016-01-17 Inkigayo Dream Suzy <NA> Baekhyun
## 5 2016-01-22 Music Bank Dream Suzy <NA> Baekhyun
## 6 2016-01-24 Inkigayo Dream Suzy <NA> Baekhyun
## # ℹ 5 more variables: solo_group <chr>, points <dbl>, is_crown_winner <lgl>,
## # debut_date <date>, gender <chr>
Now both tables have the relevant fields we’re looking for!
Let’s union these two tables, solo_wins_full and
group_wins_full, into one.
group_wins_full <- group_wins_full |> select(show_date
, show_name
, song_name
, artist_name
, sub_group
, feature_artist_name
, solo_group
, points
, is_crown_winner
, debut_date
, gender
)
all_wins <- rbind(solo_wins_full, group_wins_full) |> arrange(show_date)
head(all_wins)
## # A tibble: 6 × 11
## show_date show_name song_name artist_name sub_group feature_artist_name
## <date> <chr> <chr> <chr> <chr> <chr>
## 1 2016-01-03 Inkigayo Again Turbo <NA> <NA>
## 2 2016-01-07 MCountdown Daddy Psy <NA> <NA>
## 3 2016-01-08 Music Bank Run BTS <NA> <NA>
## 4 2016-01-10 Inkigayo Lonely Night Gary <NA> <NA>
## 5 2016-01-14 MCountdown Dumb & Dumber iKON <NA> <NA>
## 6 2016-01-15 Music Bank Dream Suzy <NA> Baekhyun
## # ℹ 5 more variables: solo_group <chr>, points <dbl>, is_crown_winner <lgl>,
## # debut_date <date>, gender <chr>
We’re reading for exploration!
Let’s start with some summary stats.
summary(all_wins)
## show_date show_name song_name artist_name
## Min. :2016-01-03 Length:1268 Length:1268 Length:1268
## 1st Qu.:2017-05-13 Class :character Class :character Class :character
## Median :2018-08-04 Mode :character Mode :character Mode :character
## Mean :2018-07-23
## 3rd Qu.:2019-10-03
## Max. :2020-12-27
##
## sub_group feature_artist_name solo_group points
## Length:1268 Length:1268 Length:1268 Min. : 3382
## Class :character Class :character Class :character 1st Qu.: 6880
## Mode :character Mode :character Mode :character Median : 8094
## Mean : 8058
## 3rd Qu.: 9237
## Max. :15019
## NA's :238
## is_crown_winner debut_date gender
## Mode :logical Min. :1995-09-06 Length:1268
## FALSE:1086 1st Qu.:2013-06-13 Class :character
## TRUE :182 Median :2014-11-17 Mode :character
## Mean :2014-07-02
## 3rd Qu.:2016-04-09
## Max. :2020-07-25
## NA's :91
Some interesting things right off the bat – winners’ debut dates range from 1995 to 2020, awesome to see that someone who debuted in 1995 can still win 20+ years later.
Let’s start breaking down winners by various slices of data.
How many artists are there that have won? How does that break down between group and soloist artists?
artists <- all_wins |>
group_by(solo_group) |>
summarise(count = n_distinct(artist_name)) |>
arrange(desc(count))
head(artists)
## # A tibble: 2 × 2
## solo_group count
## <chr> <int>
## 1 G 87
## 2 S 51
In total, there are 88 groups and 51 soloists in this dataset, resulting in 139 total artists.
Have groups always remained dominant every year or have soloists been trending upwards in wins?
library(lubridate)
library(ggplot2)
by_gs_year <- all_wins |>
group_by(solo_group, year(show_date)) |>
summarise(count = n(),
.groups = 'drop')
by_gs_year <- by_gs_year |>
rename("year" = "year(show_date)"
) |>
group_by(year) |> mutate(percent = count/sum(count))
ggplot(by_gs_year, aes(fill = solo_group
, y = percent
, x = year)) +
geom_bar(position="dodge", stat="identity")
This trend seems pretty consistent – 2019 had a bit of a bump up for soloists, but otherwise groups came back in 2020.
How many different groups won in each year? Are we seeing diverse winners or are they usually the same artists?
by_year <- all_wins |>
group_by(year(show_date)) |>
summarise(count = n(),
.groups = 'drop')
by_year <- by_year |>
rename("year" = "year(show_date)"
)
ggplot(by_year, aes(y = count
, x = year)) +
geom_bar(position="dodge", stat="identity")
There’s a slight increase in number of artists who win over time, signifying a diversified pool of musical talent.
Which artists won the most during this time period?
by_artist <- all_wins |>
group_by(artist_name) |>
summarise(count = n(),
percent = 100 * n()/nrow(all_wins),
.groups = 'drop') |>
arrange(desc(count))
head(by_artist, 10)
## # A tibble: 10 × 3
## artist_name count percent
## <chr> <int> <dbl>
## 1 BTS 114 8.99
## 2 Twice 113 8.91
## 3 GFRIEND 71 5.60
## 4 Red Velvet 63 4.97
## 5 EXO 51 4.02
## 6 Wanna One 48 3.79
## 7 BLACKPINK 43 3.39
## 8 SEVENTEEN 39 3.08
## 9 ITZY 34 2.68
## 10 MAMAMOO 33 2.60
BTS (Bangtan) won the most between 2016-2020 which is no surprise as this is when they were rising to fame. Twice is close behind and GFRIEND comes in third here – this seems very group focused though, who were the top soloists?
by_artist_solo <- all_wins |>
filter(solo_group == "S") |>
group_by(artist_name) |>
summarise(count = n(),
percent = 100 * n()/nrow(all_wins),
.groups = 'drop') |>
arrange(desc(count))
head(by_artist_solo, 10)
## # A tibble: 10 × 3
## artist_name count percent
## <chr> <int> <dbl>
## 1 IU 32 2.52
## 2 Sunmi 15 1.18
## 3 Chungha 13 1.03
## 4 Zico 12 0.946
## 5 Taeyeon 11 0.868
## 6 Kang Daniel 10 0.789
## 7 Hwasa 7 0.552
## 8 Suzy 7 0.552
## 9 Taemin 7 0.552
## 10 Baekhyun 6 0.473
IU is a very prominent soloist in the industry so her topping the charts makes sense!
What if we tried looking at wins by artist over time? To make this chart less complicated, let’s only include the top 10 groups.
by_artist_year <- all_wins |>
filter(artist_name %in% head(by_artist$artist_name, 10)) |>
group_by(artist_name, year(show_date)) |>
summarise(count = n(),
.groups = 'drop')
by_artist_year <- by_artist_year |>
rename("year" = "year(show_date)"
) |>
group_by(year) |> mutate(percent = count/sum(count))
ggplot(by_artist_year, aes(fill = artist_name
, y = count
, x = year)) +
geom_bar(position="dodge", stat="identity")
With this chart, we can kind of see the rise of some groups and the fall of others; EXO and GFRIEND had many more wins in 2016, but that number slowly dwindles as years pass. Twice follows this trend and spikes in 2017, but slowly starts decreasing in wins into 2020. On the opposite side, BTS spikes in popularity in 2020 after a gradual increase past 20 wins before skyrocketing almost to 50. It’s also interesting how we Wanna One appears in 2017 and then disappears after 2019 (this is due to Wanna One only being a temporary group). Other notable introductions are ITZY who debuted in 2019 and then BLACKPINK who have very little activity in the first three years, but then did very well in 2020.
How many songs were given crowns over the years?
by_crown_year <- all_wins |>
filter(is_crown_winner) |>
group_by(year(show_date)) |>
summarise(count = n_distinct(song_name),
.groups = 'drop')
by_crown_year <- by_crown_year |>
rename("year" = "year(show_date)"
)
ggplot(by_crown_year, aes(y = count
, x = year)) +
geom_bar(position="dodge", stat="identity")
Interestingly, we see an increase, then sharp decrease, then a spike in 2020. If we remember the artist count chart, there were more artists who won in 2019; it then makes sense that there were less crowns in 2019 as a more diverse amount of artists were winning.
Who were the winners of these crowns? Let’s start with groups.
by_crown_year_group <- all_wins |>
filter(is_crown_winner, solo_group == "G") |>
group_by(artist_name, year(show_date)) |>
summarise(count = n_distinct(song_name),
.groups = 'drop')
by_crown_year_group <- by_crown_year_group |>
rename("year" = "year(show_date)"
) |>
group_by(year) |> mutate(percent = count/sum(count))
ggplot(by_crown_year_group, aes(fill = artist_name
, y = count
, x = year)) +
geom_bar(position="dodge", stat="identity")
In terms of groups that had more than one crown a year, these were Gfriend, Twice, and BTS; Twice had a whopping 4 crowns in 2017 and 3 in 2018!
What about soloists?
by_crown_year_soloist <- all_wins |>
filter(is_crown_winner, solo_group == "S") |>
group_by(artist_name, year(show_date)) |>
summarise(count = n_distinct(song_name),
.groups = 'drop')
by_crown_year_soloist <- by_crown_year_soloist |>
rename("year" = "year(show_date)"
) |>
group_by(year) |> mutate(percent = count/sum(count))
ggplot(by_crown_year_soloist, aes(fill = artist_name
, y = count
, x = year)) +
geom_bar(position="dodge", stat="identity")
No one had more than one crown per year, but IU won three out of the four years!
Overall, do females or males win more?
by_gender <- all_wins |>
group_by(gender) |>
summarise(count = n(),
percent = 100 * n()/nrow(all_wins),
.groups = 'drop') |>
arrange(desc(count))
by_gender
## # A tibble: 3 × 3
## gender count percent
## <chr> <int> <dbl>
## 1 M 631 49.8
## 2 F 628 49.5
## 3 MF 9 0.710
This is very much a fifty-fifty split (there are a few wins from groups with more than one gender) which is interesting – what if we broke this down by group vs. solo?
by_gender_gs <- all_wins |>
group_by(gender, solo_group) |>
summarise(count = n(),
percent = 100 * n()/nrow(all_wins),
.groups = 'drop') |>
arrange(desc(count))
by_gender_gs
## # A tibble: 5 × 4
## gender solo_group count percent
## <chr> <chr> <int> <dbl>
## 1 M G 541 42.7
## 2 F G 512 40.4
## 3 F S 116 9.15
## 4 M S 90 7.10
## 5 MF G 9 0.710
We see a slight split here – males seem to dominate the wins in terms of groups, but soloist females have more wins than soloist males.
How about wins over time by gender?
by_gender_year <- all_wins |>
group_by(gender, year(show_date)) |>
summarise(count = n(),
.groups = 'drop')
by_gender_year <- by_gender_year |>
rename("year" = "year(show_date)"
) |>
group_by(year) |> mutate(percent = count/sum(count))
ggplot(by_gender_year, aes(fill = gender
, y = percent
, x = year)) +
geom_bar(position="dodge", stat="identity")
Interestingly, we can see that 2016-2018 had more male wins, but female wins became more prominent in 2019-2020. There also seemed to be more of a rise of multi-gendered groups in 2019-2020 as we saw more wins from them as well.
Do certain music shows hold more biases towards specific genders?
by_gender_show <- all_wins |>
group_by(gender, show_name) |>
summarise(count = n(),
.groups = 'drop') |>
group_by(show_name) |> mutate(percent = count/sum(count))
ggplot(by_gender_show, aes(fill = gender
, y = percent
, x = show_name)) +
geom_bar(position="dodge", stat="identity")
Inkigayo and MCountdown seem to favor female artists more heavily (especially the former) while male groups dominate the other shows; Music Bank is very much male dominated in comparison to the other shows as well.
Are winning artists typically more recent artists or older ones?
Note a limitation for this cut of data is that if a soloist doesn’t originate from a group, our data doesn’t have when they debuted. Due to this, let’s focus on just group wins.
by_debut <- filter(all_wins, solo_group == "G") |>
group_by(year(debut_date)) |>
summarise(count = n(),
.groups = 'drop') |>
arrange(desc(count))
by_debut <- by_debut |>
rename("year" = "year(debut_date)"
)
head(by_debut, 10)
## # A tibble: 10 × 2
## year count
## <dbl> <int>
## 1 2015 268
## 2 2014 179
## 3 2013 114
## 4 2016 112
## 5 2012 110
## 6 2019 61
## 7 2017 57
## 8 2018 52
## 9 2011 31
## 10 2010 23
A lot of these wins seem to come from those who debuted 2-3 years before our daterange – does that change over time?
by_debut_year <- filter(all_wins, solo_group == "G") |>
group_by(as.character(year(debut_date)), year(show_date)) |>
summarise(count = n(),
.groups = 'drop')
by_debut_year <- by_debut_year |>
rename("show_year" = "year(show_date)"
, "debut_year" = "as.character(year(debut_date))"
) |>
group_by(show_year) |> mutate(percent = count/sum(count))
ggplot(by_debut_year, aes(fill = debut_year
, y = percent
, x = show_year)) +
geom_bar(position="stack", stat="identity")
While hard to see specifics, it’s clear that in 2019, those groups did a much better job at splashing into the scene and winning shows.
What are point distributions by music show? Note that Show Champion doesn’t have a scoring system.
ggplot(filter(all_wins, show_name == "Inkigayo"), aes(x = points)) +
geom_histogram() +
theme_classic() +
ggtitle("Inkigayo Points Distribution")
Interestingly, this chart seems to have 2 curves.
ggplot(filter(all_wins, show_name == "MCountdown", ), aes(x = points)) +
geom_histogram() +
theme_classic() +
ggtitle("MCountdown Points Distribution")
The max point value of 11000 seems to interestingly have the most amount of songs. Perhaps this means that it’s common to get the maximum score.
ggplot(filter(all_wins, show_name == "Music Bank"), aes(x = points)) +
geom_histogram() +
theme_classic() +
ggtitle("Music Bank Points Distribution")
This is the most normally distributed one so far, albeit it’s a bit right-skewed.
ggplot(filter(all_wins, show_name == "Show! Music Core"), aes(x = points)) +
geom_histogram() +
theme_classic() +
ggtitle("Show! Music Core Points Distribution")
Similar to Music Bank, this looks more normally distributed and is also right-skewed.
ggplot(filter(all_wins, show_name == "The Show"), aes(x = points)) +
geom_histogram() +
theme_classic() +
ggtitle("The Show Points Distribution")
Interestingly, this is the first left skewed data!
What artists had songs that garnered the most/max amount of points?
max_points <- all_wins |>
filter(show_name != "Show Champion") |>
group_by(show_name) |>
summarise(max_points = max(points, na.rm = TRUE)
, .groups = 'drop')
max_point_wins <- inner_join(all_wins, max_points, by = c('points' = 'max_points'))
by_artist_max_points <- max_point_wins |>
group_by(artist_name, gender) |>
summarise(count = n(),
.groups = 'drop') |>
arrange(desc(count))
head(by_artist_max_points, 10)
## # A tibble: 8 × 3
## artist_name gender count
## <chr> <chr> <int>
## 1 BTS M 16
## 2 EXO M 4
## 3 NU'EST M 4
## 4 Wanna One M 3
## 5 SHINee M 2
## 6 Twice F 2
## 7 NCT M 1
## 8 Suzy F 1
BTS had 16 music show wins with maximum points – very impressive. Interestingly, we can see that only 3 of these wins are from female artists. Do these stats correlate with music show and gender? Let’s check.
by_artist_max_points_music_shows <- max_point_wins |>
group_by(show_name.x, gender) |>
summarise(count = n(),
.groups = 'drop') |>
arrange(desc(count))
ggplot(by_artist_max_points_music_shows, aes(fill = gender
, y = count
, x = show_name.x)) +
geom_bar(position="dodge", stat="identity")
From the previous chart, we saw that females won more on Inkigayo and MCountdown; this persists here with the maximum score wins as well.
Right from the start, data collection and cleaning wasn’t the
easiest. There was a constant need to go back to refresh the Google
Sheet data due to typos, inconsistencies (the downside of having
collected this data over so many years is that there were a few
variations of how things were spelled/capitilized), and just minor
adjustments due to romanization (there are many ways to type Korean
names in English letters). Getting the two data sources to match up as
well was quite taxing, but after many attempts of matching up the data,
we finally got to parity (albeit some of the idols data is
missing a few fields, but they weren’t necessary for this analysis).
Overall, we can take away a few points from the data exploration piece:
There are several next steps to take with this project, including but not limited to:
This project has a lot to expand upon and it could be interesting to come back to this at some point in the future!