DO NOT TAKE THIS ANALYSIS SERIOUSLY!! It is made out of fun.
kpop_idols dataset# Full Data
kpop_idols <- read.csv("kpop_idols.csv")
# Structure and summary of data in each column
str(kpop_idols)
## 'data.frame': 1310 obs. of 10 variables:
## $ Stage.Name : Factor w/ 1135 levels "A.M","Ace","Aeji",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Full.Name : Factor w/ 1252 levels "","Abe Haruno",..: 1079 272 684 702 155 78 407 861 498 205 ...
## $ Korean.Name : Factor w/ 1242 levels "","강가영","강경원",..: 521 920 72 794 44 1047 21 798 298 1202 ...
## $ K..Stage.Name: Factor w/ 1083 levels "","가가","가린",..: 508 507 495 490 475 482 491 509 484 481 ...
## $ Date.of.Birth: Factor w/ 1181 levels "1977-12-31","1980-12-25",..: 670 276 928 921 1059 195 630 88 202 1006 ...
## $ Group : Factor w/ 209 levels "","(G)I-DLE",..: 119 196 93 130 77 62 146 1 148 72 ...
## $ Country : Factor w/ 12 levels "Australia","Canada",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ Birthplace : Factor w/ 136 levels "","Andong","Ansan",..: 1 1 25 124 132 105 97 28 1 1 ...
## $ Other.Group : Factor w/ 56 levels "","2YOON","3RACHA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 1 1 1 1 ...
summary(kpop_idols)
## Stage.Name Full.Name Korean.Name K..Stage.Name
## Dawon : 4 : 6 : 6 수빈 : 5
## Hayoung : 4 Kim Chaewon : 3 김동현 : 3 유진 : 5
## Jinwoo : 4 Kim Donghyun: 3 김민석 : 3 지수 : 5
## Jinyoung: 4 Kim Jiwon : 3 김소희 : 3 다원 : 4
## Jisoo : 4 Kim Minseok : 3 김지원 : 3 민재 : 4
## Minhyuk : 4 Kim Sohee : 3 김채원 : 3 민혁 : 4
## (Other) :1286 (Other) :1289 (Other):1289 (Other):1283
## Date.of.Birth Group Country Birthplace
## 1994-01-18: 3 : 91 South Korea:1204 :621
## 1994-12-24: 3 NCT : 18 China : 39 Seoul :188
## 1995-01-23: 3 14U : 14 Japan : 27 Busan : 60
## 1996-11-06: 3 Cosmic Girls: 13 USA : 14 Gwangju: 35
## 1996-11-21: 3 Seventeen : 13 Taiwan : 6 Incheon: 29
## 1998-01-16: 3 IZ*ONE : 12 Thailand : 6 Daegu : 27
## (Other) :1292 (Other) :1149 (Other) : 14 (Other):350
## Other.Group Gender
## :1188 F:634
## Super Junior-M: 6 M:676
## Loona 1/3 : 5
## NCT Dream : 5
## 9MUSES A : 4
## Loona yyxy : 4
## (Other) : 98
kpop_idols datasetkpop_idols_cleaned <- kpop_idols %>%
mutate(Stage.Name=as.character(Stage.Name),Full.Name=as.character(Full.Name),Date.of.Birth=as.Date(Date.of.Birth)) %>%
select(Stage.Name,Full.Name,Date.of.Birth,Group,Country,Gender)
# blank cells in group column refer to solo artist, so I replace blank cells with "solo"
kpop_idols_cleaned$Group <- as.factor(if_else(kpop_idols_cleaned$Group == "", "Solo", as.character(kpop_idols_cleaned$Group)))
kpop_idols_cleaned
summary(kpop_idols_cleaned)
## Stage.Name Full.Name Date.of.Birth Group
## Length:1310 Length:1310 Min. :1977-12-31 Solo : 91
## Class :character Class :character 1st Qu.:1993-02-12 NCT : 18
## Mode :character Mode :character Median :1996-05-30 14U : 14
## Mean :1996-01-13 Cosmic Girls: 13
## 3rd Qu.:1999-05-14 Seventeen : 13
## Max. :2005-08-22 IZ*ONE : 12
## (Other) :1149
## Country Gender
## South Korea:1204 F:634
## China : 39 M:676
## Japan : 27
## USA : 14
## Taiwan : 6
## Thailand : 6
## (Other) : 14
Most kpop idols were born on January, and least were born on June.
# most common month
kpop_idols_month <- kpop_idols_cleaned %>%
mutate(month = month(Date.of.Birth)) %>%
count(month) %>%
arrange(desc(n)) %>%
mutate(percent = n/sum(n)*100)
kpop_idols_month
In my perspective, the table shows no significance difference among the months. So, let’s be more specific in the data i.e. find common month-date.
kpop_idols_md <- kpop_idols_cleaned %>%
mutate(md = format(Date.of.Birth, "%m-%d"))
count(kpop_idols_md, kpop_idols_md$md) %>%
arrange(desc(n))
Many idols were born on March 20. Let’s find out who were born on this date.
kpop_idols_md %>%
filter(md == "03-20") %>%
arrange(Date.of.Birth)
Now, let’s prepare data to see if our data covers all month-date pairs.
all_ymd <- seq(as.Date("1992-1-1"), as.Date("1992-12-31"), by = "1 day")
all_md <- as.factor(format(all_ymd, "%m-%d"))
kpop_md <- as.factor(unique(kpop_idols_md$md))
all_md_df_1 <- as.data.frame(all_md)
kpop_md_df_1 <- as.data.frame(kpop_md)
all_md_df_2 <- all_md_df_1 %>%
mutate(md=all_md)
kpop_md_df_2 <- kpop_md_df_1 %>%
mutate(md=kpop_md)
str(all_md_df_2)
## 'data.frame': 366 obs. of 2 variables:
## $ all_md: Factor w/ 366 levels "01-01","01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ md : Factor w/ 366 levels "01-01","01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
str(kpop_md_df_2)
## 'data.frame': 355 obs. of 2 variables:
## $ kpop_md: Factor w/ 355 levels "01-01","01-02",..: 355 231 289 261 50 143 230 147 187 195 ...
## $ md : Factor w/ 355 levels "01-01","01-02",..: 355 231 289 261 50 143 230 147 187 195 ...
No, it does not. Therefore, you have to be born on the following dates in order to be unique in the kpop industry.
no_kpop <- all_md_df_2 %>%
anti_join(kpop_md_df_2, by="md")
## Warning: Column `md` joining factors with different levels, coercing to
## character vector
no_kpop
According to the dataset, there is no kpop idol born on the above dates. If those are your birthday, you might add diversity to the kpop industry!
Data Source : Kaggle
THANK YOU :)