The Billboard Hot 100 is the music industry standard record chart in the United States for songs, published weekly by Billboard magazine. Chart rankings are based on sales, radio play, and online streaming in the United States.

Every week, Billboard releases “The Hot 100” chart of songs that were trending on sales and airplay for that week. This dataset is a collection of all “The Hot 100” charts released since its inception in 1958.

We will look at detail data of Billboard Hot 100 data from Kaggle.com to check and visualize the data. Data taken as of March 14, 2021

Data Exploratory and Explanatory

Before we do the exploratory and explanatory data analysis, we will install all the library needed to support the data analysis. The libraries are ggplot2, lubridate, scales, dplyr, scales, skimr, readr.

After we install the libraries, we call the data and check all the detail of the data.

billboard <- read_csv("charts/charts.csv")
glimpse(billboard)
## Rows: 326,687
## Columns: 7
## $ date             <date> 1958-08-04, 1958-08-04, 1958-08-04, 1958-08-04, 1958~
## $ rank             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
## $ song             <chr> "Poor Little Fool", "Patricia", "Splish Splash", "Har~
## $ artist           <chr> "Ricky Nelson", "Perez Prado And His Orchestra", "Bob~
## $ `last-week`      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ `peak-rank`      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
## $ `weeks-on-board` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
dim(billboard)
## [1] 326687      7

After we check, the dimension of the data is 326,687 rows and 7 columns. We also check the data types for every column. There are 4 columns that we have to change the data types

  • rank to factor data type. The reason is there are possibilities that 1 song or artist can rank more than 1 time in the billboard hot 100 chart
  • song to factor data type. The reason is there are possibilities that 1 song can rank more than 1 time in the billboard hot 100 chart
  • artist to factor data type. The reason is there are possibilities that 1 artist can rank more than 1 time in the billboard hot 100 chart
  • weeks-on-board to factor data type. The reason is there are possibilities that 1 song or artist can rank more than 1 weeks in the billboard hot 100 chart

We also found that there are 1 column (last-week column) that probably that we will not use. We can drop this column.

# Drop column
billboard <- read_csv("charts/charts.csv")
billboard <-
  subset(billboard,
         select = c("date", "rank", "song", "artist",
                    "peak-rank", "weeks-on-board"))

# Add column

billboard$date  <- ymd(billboard$date)
billboard$wday  <- wday(billboard$date, label = T)
billboard$month <- month(billboard$date, label = T)
billboard$year  <- year(billboard$date)

# column set names

billboard <-
  setNames(
    billboard,
    c(
      "Date",
      "Rank",
      "Song",
      "Artist",
      "Peak_Rank",
      "Weeks_On_Board",
      "Day_Rank",
      "Month_Rank",
      "Year_Rank"
    )
  )

# Data type changes

billboard$Rank             <- as.factor(billboard$Rank)
billboard$Song             <- as.factor(billboard$Song)
billboard$Artist           <- as.factor(billboard$Artist)
billboard$Weeks_On_Board   <- as.factor(billboard$Weeks_On_Board)
billboard$Year_Rank        <- as.factor(billboard$Year_Rank)

After we do a little bit of transformation in the data, we skim once again to check the detail of the data

skim(billboard)
Data summary
Name billboard
Number of rows 326687
Number of columns 9
_______________________
Column type frequency:
Date 1
factor 7
numeric 1
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 1958-08-04 2021-03-13 1989-11-25 3267

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Rank 0 1 FALSE 100 18: 3269, 81: 3269, 5: 3268, 8: 3268
Song 0 1 FALSE 24253 Sta: 208, Ang: 205, Hea: 194, Hol: 189
Artist 0 1 FALSE 9994 Tay: 1005, Elt: 889, Mad: 857, Ken: 758
Weeks_On_Board 0 1 FALSE 87 1: 29251, 2: 26791, 3: 25143, 4: 23737
Day_Rank 0 1 TRUE 3 Sat: 308787, Mon: 17800, Sun: 100, Tue: 0
Month_Rank 0 1 TRUE 12 Aug: 28000, Oct: 27900, Dec: 27796, Jan: 27795
Year_Rank 0 1 FALSE 64 196: 5300, 197: 5300, 198: 5300, 198: 5300

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Peak_Rank 0 1 41.04 29.35 1 14 38 66 100 ▇▅▅▅▃

After we skim the data, everything is great. No missing value, all columns set, all the datatypes corrected. Now we will check the descriptive statisctic for the data.

summary(billboard)
##       Date                 Rank               Song       
##  Min.   :1958-08-04   18     :  3269   Stay     :   208  
##  1st Qu.:1974-03-30   81     :  3269   Angel    :   205  
##  Median :1989-11-25   5      :  3268   Heaven   :   194  
##  Mean   :1989-11-24   8      :  3268   Hold On  :   189  
##  3rd Qu.:2005-07-23   30     :  3268   I Like It:   188  
##  Max.   :2021-03-13   34     :  3268   You      :   184  
##                       (Other):307077   (Other)  :325519  
##            Artist         Peak_Rank      Weeks_On_Board   Day_Rank    
##  Taylor Swift :  1005   Min.   :  1.00   1      : 29251   Sun:   100  
##  Elton John   :   889   1st Qu.: 14.00   2      : 26791   Mon: 17800  
##  Madonna      :   857   Median : 38.00   3      : 25143   Tue:     0  
##  Kenny Chesney:   758   Mean   : 41.04   4      : 23737   Wed:     0  
##  Drake        :   735   3rd Qu.: 66.00   5      : 22348   Thu:     0  
##  Tim McGraw   :   731   Max.   :100.00   6      : 20912   Fri:     0  
##  (Other)      :321712                    (Other):178505   Sat:308787  
##    Month_Rank       Year_Rank     
##  Aug    : 28000   1966   :  5300  
##  Oct    : 27900   1972   :  5300  
##  Dec    : 27796   1983   :  5300  
##  Jan    : 27795   1988   :  5300  
##  Mar    : 27699   1994   :  5300  
##  May    : 27500   2000   :  5300  
##  (Other):159997   (Other):294887
head(billboard)

Visualization

Now let’s do the visualization. First, let’s check who are the Top 20 Artists who have been made the longest reign in the Billboard Chart Hot 100.

artist1 <- billboard %>%
  count(Artist) %>%
  arrange(desc(n)) %>%
  head(20)

ggplot(artist1, aes(y=reorder(Artist,n), x=n)) +
  geom_col(aes(fill = n), show.legend = F) +
  scale_fill_gradient(low = "#4ef2e2", high = "#2b3bed") +
  geom_text(aes(label = n), hjust = -0.3, size = 3.5) +
  labs(
    title = "Top 20 Artist With The Longest Reign ",
    subtitle = "Billboard Chart Hot 100",
    caption = "Source Data: Kaggle.com",
    x = "Weeks",
    y = ""
  ) +
  theme_bw()

Wow what a diverse list of Artists in the Top 20 longest weeks on Billboart chart Hot 100. However, Taylor Shift is the boss. She beats most of legendary Artists to be the longest reign artist in the Billboard Chart Hot 100 with a whooping total 1005 weeks. Crazy!

Let’s find out the Top 20 Artists with the most song entries in the Billboard Chart Hot 100

sum1 <- billboard %>%
  group_by(Artist, Song) %>%
  summarise(Peak_Rank = sum(Peak_Rank))

sum2 <- sum1 %>% group_by(Artist, Song) %>%
  count(Song)

sum3 <- sum2 %>% group_by(Artist) %>%
  summarise(jumlah_lagu = sum(n)) %>%
  arrange(desc(jumlah_lagu)) %>%
  head(20)

ggplot(sum3, aes(
  area = jumlah_lagu,
  label = reorder(Artist, jumlah_lagu),
  fill = jumlah_lagu
)) +
  geom_treemap() +
  geom_treemap_text(
    place = "centre",
    grow = T,
    reflow = T,
    colour =
      "yellow",
    fontface = "italic",
    min.size = 5
  ) +
  theme(legend.position = "bottom") +
  labs(
    title = "Top 20 Artist with The Most Song Entries",
    subtitle = "Billboard Chart Hot 100",
    caption = "Source: Kaggle.com",
    fill = "Number of Song Entries"
  )

Wow, Glee TV Show is indeed one of the best pop culture in our lifetime. The songs by its casts are the most entries in the Billboard chart Hot 100 with 183 songs. It is crazy!

Let’s check the Top 20 artists who have been reaching no 1 spot in Billboard hot 100 chart with the longest weeks of no 1 rank on the chart.

rank1 <- subset(billboard, subset = Rank == "1" & Peak_Rank == 1) %>%
  arrange(desc(Weeks_On_Board))

rank2 <- rank1 %>%
  group_by(Artist) %>%
  summarise(jumlah_Peak_Rank1 = sum(Peak_Rank)) %>%
  arrange(desc(jumlah_Peak_Rank1)) %>%
  filter(jumlah_Peak_Rank1 > 1) %>%
  head(20)

ggplot(rank2, aes(label = Artist, size = jumlah_Peak_Rank1)) +
  geom_text_wordcloud_area(aes(col=jumlah_Peak_Rank1)) +
  scale_size_area(max_size = 40) +
  theme_bw()

All the Artist in the the graphic are very iconic and our legendary diva Mariah Carey is the MVP. She is leading the pack as the most no 1 artist with the longest weeks reign on the no 1 rank. She even beats The Beatles, She is amazing!

Next lets’s check the Top 20 Songs which have been made the longest reign in the Billboard Chart Hot 10.

song1 <- billboard %>%
  group_by(Song, Artist) %>%
  count() %>%
  arrange(desc(n)) %>%
  head(20)

ggplot(song1, aes(x = n, y = reorder(Song, n))) +
  geom_col(aes(fill = n), show.legend = F) +
  scale_fill_gradient(low = "firebrick", high = "dodgerblue4") +
  geom_text(aes(label = Artist),
            hjust = 1.1,
            size = 3.5,
            col = "white") +
  geom_text(aes(label = n),
            hjust = -0.2,
            size = 3.5,
            col = "black") +
  labs(
    title = "Top 20 Songs With The Longest Reign in Billboard Chart Hot 100",
    caption = "Source Data: Kaggle.com",
    x = "Weeks",
    y = "",
    fill = "Weeks"
  ) +
  theme_bw()

Wow it surprisingly Radioactive by Imagine Dragon that takes the crown as the longest reign song in Billboard Chart Hot 100. It is indeed a majestic song!

Now lets check the Top 20 songs that has been reaching no 1 spot in Billboard hot 100 chart with the longest weeks on no 1 rank.

rank3 <- subset(billboard, subset = Rank == "1" & Peak_Rank == 1) %>%
  arrange(desc(Weeks_On_Board))

rank4 <- rank3 %>%
  group_by(Song, Artist) %>%
  summarise(jumlah_Peak_Rank3 = sum(Peak_Rank)) %>%
  arrange(desc(jumlah_Peak_Rank3)) %>%
  head(20)


ggplot(rank4, aes(y = reorder(Song, jumlah_Peak_Rank3), x = jumlah_Peak_Rank3)) +
  geom_col(aes(fill = jumlah_Peak_Rank3), show.legend = F) +
  geom_text(aes(label = Artist),
            hjust = 1.1,
            col = "black",
            size = 3.5) +
  geom_text(
    aes(label = jumlah_Peak_Rank3),
    hjust = -0.3,
    col = "black",
    size = 3
  ) +
  scale_fill_gradient(low = "orange", high = "#e334e0") +
  labs(
    title = "Top 20 Song With The Longest Reign On No 1 Rank",
    subtitle = "Billboard Hot 100 chart",
    caption = "Source Data: Kaggle.com",
    x = "Frequency no 1 rank",
    y = ""
  ) +
  theme_bw()

Wow surprisingly “Old Town Road” by Lil Nas and Billy Ray Cyrus is on the top of the list. This song is very current and succesfully beats all massive songs in the list as the longest weeks in no 1 rank on the Billboard Chart Hot 100. Standing applause!