This project focuses on analyzing a dataset of K-pop idols. The main objective is to explore the potential correlation between an idol’s success and the duration of their career. By examining various factors and metrics, we aim to gain insights into the relationship between an idol’s career start and their level of success. Additionally, as part of this project, we aim to create visually appealing infographics by utilizing various plots and visualizations. The objective is to present the findings and insights from the analysis in a visually engaging manner, allowing for easy understanding and interpretation of the data. Through the use of effective plots and graphics, we intend to convey the results of our analysis in a visually impactful way.

This project utilizes two datasets sourced from different platforms. One dataset is obtained from Kaggle, a renowned data science community, while the other dataset is collected from a fan website dedicated to the subject. The datasets can be accessed through the following links: - kaggle: https://www.kaggle.com/datasets/nicolsalayoarias/all-kpop-idols - fan web: https://dbkpop.com/db/kpop-idol-instagram-accounts/

Without any further do, let’s do the analysis and visualization!

Data Wrangling

Library

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ ggplot2 3.4.0     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ readr   2.1.3

## Warning: package 'ggplot2' was built under R version 4.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Data Preparations

Let’s read the first data.

kpop <- read.csv('kpopidolsv3.csv')
head(kpop)

nrow(kpop)

## [1] 1778

sum(is.na(kpop))

## [1] 2154

The dataset consists of 1778 rows, representing the inclusion of 1778 idols in the data. It encompasses comprehensive biodata of the idols, including details such as birth date, debut date, height, and weight. However, to assess the success of an idol, we require additional information, which is why we incorporate a second dataset. This supplementary dataset contains the names of the idols and the number of Instagram followers they have, enabling us to gauge their level of popularity and social media presence.

Let’s read the second data.

followers <- read.csv('kpopfollowers.csv')

tail(followers)

nrow(followers)

## [1] 720

The second dataset contains 720 rows of data.

Now, our objective is to merge these two datasets together. Merging datasets requires establishing a connection between the data points. In this case, we can utilize the name of the idol as the primary connection. However, due to the presence of idols with the same name, relying solely on this connection may result in data loss. To address this issue, we can introduce a secondary connection by using the group name variable.

Upon inspecting the group name column, we observe some inconsistencies, where some entries are fully uppercase while others have only the first alphabet capitalized. To ensure consistency, we will transform the group name column to fully uppercase format, eliminating any discrepancies and facilitating a smooth merging process.

kpop$Group <-  toupper(kpop$Group)

followers$Group <- toupper(followers$Group)

We also need to change the column name on the followers data to the same name (Stage.Name) on the kpop data.

colnames(followers)[1]="Stage.Name"

# merging
kpop_idol <- merge(followers, kpop, by = c("Stage.Name", "Group"))

# print merged table
head(kpop_idol)

We can see that the data has successfully merged. Let’s see how man y data we get.

nrow(kpop_idol)

## [1] 406

We can obtain 406 row data. This means some row are missing because there are some unmatch between two data. For example, in ‘Follower’ dataset contain some data that are not in ‘idol’ dataset and vise versa. This happen because we use ‘merge()’ function which only give the inner join of merging process. In my opinion 400 row data is sufficient for analysis. If we need more data we can use left join or even outer join method that will return more data. But this mean that there will be more NA data which we don’t need it.

The next step is we need to clean the data. We need to make the data suitable for analysis. We can do it by changing data type, NA value handling, etc.

glimpse(kpop_idol)

## Rows: 406
## Columns: 19
## $ Stage.Name     <chr> "Ace", "Ahra", "Ahyoung", "Alice", "Amber", "Arang", "A…
## $ Group          <chr> "VAV", "FAVORITE", "DAL SHABET", "HELLO VENUS", "F(X)",…
## $ ig_name        <chr> "ace.vav", "ahra.view", "a_young91", "hv_alice", "ajol_…
## $ Followers      <chr> "335.439", "12.342", "104.79", "111.35", "5.519.743", "…
## $ Gender.x       <chr> "Boy", "Girl", "Girl", "Girl", "Girl", "Girl", "Girl", …
## $ Full.Name      <chr> "Jang Wooyoung", "Go Ahra", "Cho Jayoung", "Song Joohee…
## $ Korean.Name    <chr> "장우영", "고아라", "조자영", "송주희", "엠버 조세핀 리…
## $ K.Stage.Name   <chr> "에이스", "아라", "아영", "앨리스", "엠버", "아랑", "아…
## $ Date.of.Birth  <chr> "28/08/1992", "21/02/2001", "26/05/1991", "21/03/1990",…
## $ Debut          <chr> "31/10/2015", "5/07/2017", "3/01/2011", "9/05/2012", "5…
## $ Company        <chr> "A team", "Astory", "Happy Face", "Fantagio", "SM", "My…
## $ Country        <chr> "South Korea", "South Korea", "South Korea", "South Kor…
## $ Second.Country <chr> "", "", "", "", "Taiwan", "", "", "", "USA", "USA", "",…
## $ Height         <int> 177, NA, NA, 166, 167, 167, NA, 163, NA, NA, 183, 178, …
## $ Weight         <int> 63, NA, NA, 47, NA, 49, NA, 47, NA, NA, 61, 60, 41, 58,…
## $ Birthplace     <chr> "", "Yeosu", "Seoul", "Wonju", "Los Angeles", "", "Daeg…
## $ Other.Group    <chr> "", "", "", "", "", "", "", "", "NU'EST W", "", "", "Ba…
## $ Former.Group   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ Gender.y       <chr> "M", "F", "F", "F", "F", "F", "F", "F", "M", "F", "M", …

Let’s remove ‘.’ from the “Followers” column. We need to remove it before we change the data type to numeric. If we don’t do it, the system will think if it is a comma.

kpop_idol$Followers <- str_remove(kpop_idol$Followers, '[.]')
kpop_idol$Followers <- str_remove(kpop_idol$Followers, '[.]')
kpop_idol$Followers

##   [1] "335439"   "12342"    "10479"    "11135"    "5519743"  "17749"   
##   [7] "30965"    "1857972"  "718784"   "541344"   "442087"   "313078"  
##  [13] "71558"    "21803656" "17525604" "10703"    "2703891"  "3207006" 
##  [19] "932196"   "1984459"  "53052"    "2016565"  "7476"     "20562"   
##  [25] "39052"    "6162880"  "2213645"  "522877"   "8858825"  "38718"   
##  [31] "98726"    "78478"    "498678"   "316014"   "24056706" "5591"    
##  [37] "57592"    "23906"    "1008659"  "279558"   "7657"     "741172"  
##  [43] "101954"   "8081245"  "1129"     "364197"   "571988"   "3016"    
##  [49] "55874"    "3642833"  "4883637"  "66459"    "311691"   "7138398" 
##  [55] "69246"    "210857"   "62791"    "1617496"  "1135625"  "1420030" 
##  [61] "11679382" "69827"    "19092"    "133912"   "164017"   "76353"   
##  [67] "2055521"  "888517"   "106509"   "3495608"  "1855917"  "53992"   
##  [73] "2597797"  "517344"   "913663"   "1053513"  "745723"   "36175751"
##  [79] "90661"    "726651"   "347262"   "369552"   "1183102"  "369268"  
##  [85] "18222"    "60851"    "152719"   "47114"    "6987487"  "80806"   
##  [91] "43153"    "40392"    "95453"    "343293"   "56581"    "4632988" 
##  [97] "18645"    "17617"    "1006934"  "653437"   "5595526"  "149603"  
## [103] "11142"    "1525912"  "692023"   "5471781"  "229187"   "8829733" 
## [109] "283235"   "10349"    "355047"   "562376"   "109027"   "7811045" 
## [115] "991944"   "2108261"  "120428"   "645628"   "3600844"  "55988"   
## [121] "524063"   "33016"    "1845947"  "189726"   "926185"   "839468"  
## [127] "337516"   "11113475" "83167"    "45880758" "1653959"  "32016901"
## [133] "30074"    "1441270"  "212434"   "22154"    "493614"   "14585953"
## [139] "1934817"  "12480330" "23979"    "6875584"  "53967"    "78527157"
## [145] "5119723"  "7215888"  "136957"   "99111"    "697177"   "8786200" 
## [151] "49385614" "45165530" "252607"   "9876211"  "72397050" "145099"  
## [157] "474733"   "96271"    "469096"   "5018813"  "9561190"  "676415"  
## [163] "39045"    "522061"   "19236"    "6463871"  "14543633" "1160316" 
## [169] "2893796"  "17997"    "47258"    "4177907"  "54749"    "1166912" 
## [175] "9029"     "2803634"  "2585421"  "290206"   "14619606" "2278"    
## [181] "4765820"  "132302"   "911079"   "99881"    "9641860"  "1688597" 
## [187] "265786"   "10563163" "116457"   "2048308"  "636777"   "3327774" 
## [193] "14251685" "941816"   "5251060"  "8063"     "1146669"  "79214"   
## [199] "93337766" "307665"   "135851"   "122293"   "554997"   "1516689" 
## [205] "13765664" "11345668" "73732"    "536028"   "66938"    "684661"  
## [211] "1241751"  "8317566"  "1503250"  "9406795"  "3303247"  "1998207" 
## [217] "1399377"  "5726329"  "12561"    "14126"    "954"      "5615380" 
## [223] "6264461"  "60639"    "50368"    "229596"   "664093"   "80632"   
## [229] "18352"    "3582981"  "3068910"  "11504255" "9076"     "5093547" 
## [235] "1861221"  "1585101"  "826278"   "3933462"  "3427748"  "1495340" 
## [241] "29212"    "119089"   "10757436" "90254"    "368938"   "16043"   
## [247] "1423661"  "35983"    "2854748"  "1232537"  "408199"   "1138977" 
## [253] "639657"   "86639"    "17675"    "1657143"  "45323"    "1250189" 
## [259] "7842916"  "43681931" "3126042"  "5274298"  "5819916"  "2542"    
## [265] "20307"    "499444"   "7661574"  "109208"   "9887725"  "205899"  
## [271] "3719191"  "20393"    "23425827" "57287"    "9965492"  "472154"  
## [277] "9352"     "761738"   "3777078"  "773"      "433758"   "36896"   
## [283] "70519"    "154746"   "12893342" "11172"    "370294"   "389735"  
## [289] "491725"   "5894890"  "14834"    "1313766"  "414544"   "2778683" 
## [295] "6215399"  "3633778"  "4663"     "2926"     "8719"     "1564528" 
## [301] "8968317"  "35207"    "446659"   "103712"   "829165"   "89294"   
## [307] "3295954"  "269126"   "1064462"  "92126"    "8737"     "252157"  
## [313] "1590111"  "11408"    "1340087"  "86818"    "3826310"  "92115"   
## [319] "380287"   "147784"   "10703170" "17792"    "20452"    "3784943" 
## [325] "51478"    "83773"    "513914"   "4778"     "252627"   "165395"  
## [331] "295195"   "3389254"  "323605"   "5906526"  "5417723"  "10567442"
## [337] "20396"    "6705619"  "11024366" "111781"   "718"      "1632732" 
## [343] "58541265" "5894298"  "3242461"  "219814"   "11402"    "7690631" 
## [349] "2538329"  "8165210"  "1310038"  "5104706"  "7948915"  "228438"  
## [355] "1404946"  "5022474"  "51448"    "8402552"  "160219"   "146061"  
## [361] "17304"    "228766"   "2213433"  "292299"   "198621"   "14028074"
## [367] "614951"   "56818"    "340828"   "10757106" "1486458"  "1036"    
## [373] "6506604"  "8439"     "98473"    "34988"    "921525"   "130237"  
## [379] "105187"   "1226311"  "1232061"  "288984"   "6692941"  "131375"  
## [385] "12132886" "612734"   "15812"    "4836626"  "44341"    "1367944" 
## [391] "439441"   "322465"   "5427193"  "5838243"  "1850297"  "1502071" 
## [397] "16127"    "9329651"  "392397"   "239878"   "777028"   "4814973" 
## [403] "247204"   "81988"    "398744"   "64731"

Next step, we can change some data types.

# date data to date time type
kpop_idol$Date.of.Birth <- dmy(kpop_idol$Date.of.Birth)
kpop_idol$Debut <- dmy(kpop_idol$Debut)

# Followers row to numeric
kpop_idol$Followers <- as.numeric(kpop_idol$Followers)

We can make some addition data from exixting data. For example we can obtain the age of idol and year career from birth data and debut date.

library(eeptools)

## Warning: package 'eeptools' was built under R version 4.2.3

kpop_idol$age <- floor(age_calc(kpop_idol$Date.of.Birth, units = "years"))

kpop_idol$year.career <- floor(age_calc(kpop_idol$Debut, units = "years"))

head(kpop_idol)

We can se that the columns has been successfully added

There are two columns of gender. It’s redundant because we only need one. We also can change the data type of the gender into foctor (category)

kpop_idol <- kpop_idol %>% select(-c(Gender.x))

kpop_idol$Gender.y <- as.factor(kpop_idol$Gender.y)
head(kpop_idol)

OK! The data is ready now. We can go to the next process.

EDA (Exploratory Data Analysis)

10 most followers idols

kpop_idol %>% 
  arrange(desc(Followers)) %>% 
  head(10)

I think it’s interesting to analyse 100 most follower idol on a separated analysis from the full data.

top_100 <- kpop_idol %>% 
  arrange(desc(Followers)) %>% 
  head(100)

Let’s see the gender proportion of the idols

table(kpop_idol$Gender.y)

## 
##   F   M 
## 239 167

we can see that there are 239 female idol, and 167 male idol. We have more female idol in our data.

table(top_100$Gender.y)

## 
##  F  M 
## 41 59

But, it turns out that there are more men than women in the 100 people with the most followers.

let’s see the followers on the data.

boxplot(kpop_idol$Followers)

The data seems to be left skewed and also too many outliers on the upperside of the followers. Let’s try to see it on the 100 top idols.

boxplot(top_100$Followers)

It’s still left skewed but the quantile range is wider now.

Let’s see the median of the followers on separated gender. We use the median to avoid the outlier effect.

kpop_idol %>% 
  aggregate(Followers ~ Gender.y,
            FUN = median)

top_100 %>% 
  aggregate(Followers ~ Gender.y,
            FUN = median)

Let’s see the average followers on every group.

top_100 %>%  aggregate(Followers ~ Group,
                     FUN = mean) %>% 
  arrange(desc(Followers))

Let’s try using median

top_100 %>%  aggregate(Followers ~ Group,
                     FUN = median) %>% 
  arrange(desc(Followers))

Let’s see the correlation on very numeric column. From the correlation we can see whether how one column affect other column. The range of the correlation is betwen -1 and 1. -1 means hig opposite correlation, 1 means high linear correlation, and 0 means no correlation.

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggcorr(data = kpop_idol, label = T)

## Warning in ggcorr(data = kpop_idol, label = T): data in column(s) 'Stage.Name',
## 'Group', 'ig_name', 'Full.Name', 'Korean.Name', 'K.Stage.Name', 'Date.of.Birth',
## 'Debut', 'Company', 'Country', 'Second.Country', 'Birthplace', 'Other.Group',
## 'Former.Group', 'Gender.y' are not numeric and were ignored

We can see that the ‘Followers’ column is not affected by any other column on our data. Height and weight definitely have high correlation.

Let’s see the average age of the idol

mean(kpop_idol$age)

## [1] 27.42118

mean(top_100$age)

## [1] 27.02

The top 100 idols have a slight younger average age than all idols on the data.

mean(kpop_idol$year.career)

## [1] 8.204433

mean(top_100$year.career)

## [1] 8.34

The top 100 idols has a slight longer average career than all idols on the data.

boxplot(kpop_idol$year.career)

boxplot(top_100$year.career)

We can see some outliers from the top 100 idol’s career year.

mean(kpop_idol$Height, na.rm = TRUE)

## [1] 170.5

Let’s analyse idol’s birth place. But before we proceed we need to fill the blank data with ‘Unkown’.

kpop_idol <- kpop_idol %>%
  mutate(Birthplace = ifelse(Birthplace == "", "Unknown", Birthplace))

kpop_idol %>% 
  group_by(Birthplace) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count))

Most idol’s birth place is in Seoul.

Let’s see it in the top 100 idols data.

top_100 <- top_100 %>%
  mutate(Birthplace = ifelse(Birthplace == "", "Unknown", Birthplace))

top_100 %>% 
  group_by(Birthplace) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count))

Seoul is the city with the most idols inside 100 top idols.

Let’s see whether the most followers idol are also from Seoul or not.

kpop_idol %>%  aggregate(Followers ~ Birthplace,
                     FUN = median) %>% 
  arrange(desc(Followers))

ggplot(kpop_idol, aes(x= Followers, y=reorder(Birthplace, Followers) ))+
  geom_boxplot()

Data Visualization

Tooltip Text Preparation

library(glue)
top_100 <- top_100 %>% 
  mutate(tooltip1 = glue(
    "{Full.Name}, {age} years
    Followers: {Followers}
    Group: {Group}
    Born: {Birthplace},
    Year Career: {year.career} years"
  ))

options(scipen = 999)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

plot1 <- ggplot(top_100[1:10,], aes(x = Followers, y = reorder(Stage.Name, Followers), text = tooltip1)) +
  geom_col(aes(fill = Followers), show.legend = F) + # atur fill berdasarkan total_vid
  geom_label(mapping = aes(label = comma(Followers))) +
  labs(title = "Top 10 K-pop Idol Based on Istagram Followers",
       subtitle = "Based on Istagram Followers",
       caption = "Source: maling pangsit dot com",
       x = "Total Followers",
       y = NULL) +
  scale_fill_gradient(low = "ivory", high = "pink") +
  scale_x_continuous(limits = c(0,100000000),
                     labels = scales::label_number_si())

## Warning: `label_number_si()` was deprecated in scales 1.2.0.
## ℹ Please use the `scale_cut` argument of `label_number()` instead.

ggplotly(plot1, tooltip = "tooltip1")

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomLabel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

Let’s see the idol’s height based on genders.

plot2 <- ggplot(kpop_idol, aes(x = Gender.y, y = Height))+
  geom_boxplot(fill = "lightblue", col = "black")+ 
  geom_jitter(col = "purple")+
  labs(title = "Idol's Height Based on Gender",
       x = "Gender",
       y = "Height (in cm)")

ggplotly(plot2, tooltip = "y")

## Warning: Removed 158 rows containing non-finite values (`stat_boxplot()`).

## Warning: The following aesthetics were dropped during statistical transformation:
## y_plotlyDomain
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

plot3 <- ggplot(kpop_idol, aes(x = Gender.y, y = Weight))+
  geom_boxplot(outlier.shape = NA, fill = "lightblue", col = "black")+
  geom_jitter(col = "purple")+
  labs(title = "Idol's Weight Based on Gender",
       x = "Gender",
       y = "Weight (in kg)")

ggplotly(plot3, tooltip = "y")

## Warning: Removed 231 rows containing non-finite values (`stat_boxplot()`).

## Warning: The following aesthetics were dropped during statistical transformation:
## y_plotlyDomain
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

plot4 <- ggplot(kpop_idol, aes(x = year.career, y = Followers))+
  geom_point(col = "purple")+
  scale_y_continuous(limits = c(0,100000000),
                     labels = scales::label_number_si())+
  labs(title = "Idol's Followers Based on Year Career",
       x = "Year Career",
       y = NULL)

ggplotly(plot4)

Conclusion

We have more female idol in our data.
There are more men than women in the 100 people with the most followers.
Male idols have more followers (based on median).
Idol’s amount of followers has least correlation with idol’s height, weight, and year career.
Most idols (of top 100 idols) born in Seoul.

Kpop Idols Analysis

Faisal Amir Maz

May 30, 2023