I’ve recently been getting into the podcast “The Indicator by Planet Money”, which is a breakout series from NPR’s Planet Money. Each weekday the hosts, Stacey Vanek Smith and Cardiff Garcia, choose an interesting number from the news and spend about 8 minutes talking about it in more detail.

This past Wednesday (April 4, 2018), an episode titled “Stop, Collaborate, and Listen” discussed the emergence of song collaborations and their success on the Billboard Hot 100 chart. Most notably, Vanek Smith and Garcia state that collaborations today make up about 35% of the list, whereas in the 90s their share was only 5%.

Interested to see if I could replicate this result and find some more insight into this phenomenon, I sought out data on the historic weekly Billboard Hot 100 charts and analyzed it with R. The data was compiled by Michael Kling, who wrote a Python scraper program to obtain the information. It isn’t the cleanest dataset, but it works for our purpose.

Let’s take a look at the data we’ll be using.

#load in packages
library(tidyverse)
library(lubridate)

#import Billboard dataset
data <- read.csv("all_billboard_data.txt", sep = "|")
data$chart.date <- ymd(data$chart.date)
data$year <- year(data$chart.date)
head(data,n=5)
##   pos last.week peak weeks.on.chart               title
## 1   1         1    1             15        Uptown Funk!
## 2   2         2    2             20   Thinking Out Loud
## 3   3         6    3              7 Love Me Like You Do
## 4   4         5    4              6               Sugar
## 5   5         3    2             28   Take Me To Church
##                             artist chart.entry.date entry.position
## 1 MARK RONSON featuring BRUNO MARS            41972             65
## 2                       ED SHEERAN            41937             69
## 3                   ELLIE GOULDING            42028             45
## 4                         MAROON 5            42035              8
## 5                           HOZIER            41881             96
##   overall.peak overall.weeks.on.chart chart.date year
## 1            1                     15 2015-03-07 2015
## 2            2                     20 2015-03-07 2015
## 3            3                      7 2015-03-07 2015
## 4            4                      6 2015-03-07 2015
## 5            2                     28 2015-03-07 2015
range(data$chart.date)
## [1] "1940-07-20" "2015-03-07"

The variables in the dataset include:

We also see the range of dates in the dataset, 07/20/1947 to 03/07/2015.

The artist variable is the one we want to use to determine whether the song is a collaboration or a solo. When scrolling through the data, it’s evident that there’s not a standard means of expressing collaboration. Multiple artists are listed in artist, and these are primarily separated through key words (“featuring”, “feat.”, “with”, “and”) and symbols (“&”, “/”). Let’s create a new data frame with only songs performed by more than one artist.

#create a subset with collaborative efforts
data$collab <- ifelse(str_detect(data$artist, c("feat","&","with","and","/")), "Collaboration","Solo")
collab <- data %>%
  filter(collab == "Collaboration")

Now that we’ve identified collaborative efforts within the dataset, let’s look at the Billboard Hot 100 chart itself.

chart_size <- data %>%
  group_by(chart.date) %>%
  summarize(size=n())

ggplot(chart_size,aes(x=chart.date,y=size)) + 
  geom_bar(stat = "identity") +
  xlab("Year") + ylab("Chart Size") +
  ggtitle("Billboard Chart Size Over Time")

When the data was first available, Billboard only included the top 10 songs in its rankings. Over time this number varied, but it has remained at 100 since the mid 80’s. This is particularly important to consider if we analyze metrics such as a song’s duration on the chart, which will be understated in the first 15 years of observation, or a song’s peak position on the chart, which is overstated in the first 15 years.

Now we’ll look at the total occurrences of collaborations on the Billboard Hot 100 over time.

#summarize data by year
yearly <- collab %>%
  group_by(year) %>%
  summarize(tot=n())

#plot yearly collaboration occurrences 
ggplot(yearly,aes(x=year,y=tot)) +
  geom_bar(stat = "identity") +
  geom_smooth() +
  xlab("Year") + ylab("Number of Songs") +
  ggtitle("Frequency of Song Collaborations on the Billboard Hot 100 List")

Indeed we see that the number of songs with multiple artists on the Billboard Hot 100 chart has been growing over time, with a particularly sharp spike in the past two decades. It’s interesting to see the prominence of collaborations in the 1940’s and 50’s. If we look at the collaborations during this time we see that they were mostly vocals accompanied by orchestras. This isn’t how we think of collaborations today, which are typically both vocals, but nonetheless we can still consider it for our analysis.

artists <- data %>%
  filter(collab == "Collaboration",
         chart.date < "1956-01-01") %>%
  group_by(artist,title) %>%
  summarize(num_weeks=n()) %>%
  arrange(-num_weeks)
head(artists, n=10)
## # A tibble: 10 x 3
## # Groups: artist [10]
##    artist                              title                       num_we…
##    <fctr>                              <fctr>                        <int>
##  1 LES BROWN & HIS BAND OF RENOWN / D… Sentimental Journey              15
##  2 KAY KYSER & HIS ORCHESTRA / HARRY … Who Wouldn't Love You            12
##  3 BING CROSBY & THE ANDREWS SISTERS … Don't Fence Me In                11
##  4 GLEN GRAY & HIS ORCHESTRA / EUGENI… My Heart Tells Me (Should …      11
##  5 BING CROSBY & THE WILLIAMS BROTHER… Swinging On A Star               10
##  6 GLENN MILLER & HIS ORCHESTRA / RAY… Elmer's Tune                     10
##  7 GLENN MILLER & HIS ORCHESTRA / TEX… Chattanooga Choo Choo            10
##  8 HARRY JAMES & HIS MUSIC MAKERS / H… I Had The Craziest Dream         10
##  9 JOHNNY MERCER & THE PIED PIPERS / … On The Atchison, Topeka & …      10
## 10 TOMMY DORSEY & HIS ORCHESTRA / FRA… There Are Such Things            10

Let’s now look at the share of the Billboard Hot 100 that collaborations occupy.

#make data frame with chart totals for collabs and solos
compare <- data %>%
  group_by(year,collab) %>%
  summarize(tot=n())

#create a "percentage of collabs on chart per year" variable
compare <- spread(compare,collab,tot)
compare[is.na(compare)] <- 0
compare$pct <- compare$Collaboration/compare$Solo

#plot percentages over time
ggplot(compare,aes(x=year,y=pct)) + 
  geom_bar(stat = "identity") + 
  geom_smooth() +
  xlab("Year") + ylab("Percent") +
  ggtitle("Share of Song Collaborations on the Billboard Hot 100 List")

Off the bat, we see that songs in the 1940’s made up much of Billboard’s rankings. The data suggests that collaborations weren’t very popular from the 60’s to 90’s, is they only made up about 1% of the list, or one song per week. However, from the 90’s to today there has been a pretty steady increase in collaborations featured on the Billboard Hot 100.

The Indicator reported that about 35% of the Billboard Hot 100 consisted of collaborations. This is quite a bit higher than these results, which ends at about 9.5% in 2015. There are a couple reasons why this might be the case:

The increase in song collaborations over the past two decades seems like a trend that will continue, especially if the actual results are close to what The Indicator reports. Perhaps this is indicative of changing preferences, but I can see how collaborations could be advantageous for the music industry as a whole.

Let’s look at how collaborations perform compared to solo performances.

#create data frame that removes multiple song observations, keeping info on the longest run it had on the chart
weeks <- data %>%
  group_by(artist,title,collab) %>%
  summarize(weeks_on_chart = max(overall.weeks.on.chart),
            year = max(year)) %>%
  filter(!is.na(weeks_on_chart))

#summarize the average number of weeks on the chart by collab type
weeks_avg <- weeks %>% group_by(collab) %>%
  summarize(avg_weeks=mean(weeks_on_chart))

#plot average weeks on chart  
ggplot(weeks_avg,aes(x=collab,y=avg_weeks)) + 
  geom_bar(stat = "identity") +
  ggtitle("Average Number of Weeks on the Billboard Hot 100 List") +
  xlab("") + ylab("")

#prep and conduct samples for test of significance
collab_yes <- filter(weeks,collab == "Collaboration")
collab_no <- filter(weeks,collab == "Solo")
t <- t.test(collab_yes$weeks_on_chart,collab_no$weeks_on_chart)
t$data.name <- "Collaborations and Solo Performances"
t
## 
##  Welch Two Sample t-test
## 
## data:  Collaborations and Solo Performances
## t = 20.127, df = 3277.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.966218 3.606522
## sample estimates:
## mean of x mean of y 
##  13.57480  10.28843

One metric to measure a song’s performance is the number of weeks its spent on the Billboard list. When we compare collaborations and solos, we see that collaborations remained on the Billboard Hot 100 list for 13.57 weeks and solo performances remained on the list for 10.29 weeks. When we run a test for significant using a t-test, we see that these results are significant with a t-statistic of 20.127.

Collaborations are not only growing in popularity, they can be an effective way to promote a song or artists. It would be interesting to examine this further, and there are several ways to do that: