Spotify is considered as one of the most popular music streaming services currently available, with access to millions of songs for anyone to listen to. We both really enjoy music, but unfortunately have no dreams of becoming famous musicians. However, we wanted to know what it might take to become a popular musician on Spotify. With all the information available through Spotify, we set out to analyze this data and try to find out what factors make a song popular (if any) so that we can help others make successful songs according to specific genres. We also want to see what genre is the most produced, and what helps a song potentially show up on multiple people’s playlists. We have 23 variables, 12 of which are measurable variables, and over 32,000 songs to use to answer these questions.
Using R and R Studio, we want to first clean and organize the data, spotify_songs.csv. Next, we will start simple and find out what are the most produced genres, and how many duplicated song there are. After that, we will compare a few variables to each other to try and find correlation. This step will be tricky because correlation doesn’t always mean causation, so we want to avoid comparing variables that may not impact each other.
To further analyze the data, we will be using basic statistical techniques, such as mean, variance, correlation, and regression to address our problems. We hope these will be able to answer all our questions, but may use additional techniques as we go. We will use packages such as tidyverse to help clean the data, ggplot2 to help with visualizing the data, and a few other packages to help organize the data.
Our analysis can help a consumer, such as a recording artist or a music producer, decide how to approach creating a song and what factors to consider in order to maximize the popularity of their song.
*all of the above packages are included in tidyverse, so it’s easiest to load that
lm functions into tidy tibbleslibrary(tidyverse)
library(knitr)
library(broom)
library(pander)
library(DT)
Spotify Songs
The data comes from Spotify via the spotifyr package. Kaylin Pavlik gathered the original data to compare six genres of music to summarize what variables stand out in specific genres.
The original dataset contains 23 variables and 32,833 songs spanning across 6 genres (EDM, Latin, Pop, R&B, & Rock). Some peculiarities that we noticed in the original data set was that some of the observations within track_artist,track_name, playlist_name, and track_album_name all had many observations with some unusual characters. For example, a couple of the observations had Beyoncé as the track_artist. This may be an error with how the data was imputed, and the observations are supposed to indicate Beyonce potentially instead. Our challenge will be to fix these unusual characters.
Let’s take a look at the data:
spotify <- read_csv("spotify_songs.csv")
The data has 32833 rows and 23 columns.
datatable(head(spotify,10))
Looking at the initial data, we can see that the variable names don’t need adjusting as they are already in an easy to read and organized format. We need to take a closer look at the data to see if anything needs fixed.
Variable.type <- lapply(spotify, class)
Variable.desc <- c("Unique ID assigned to each song", "Song name", "Song artist",
"Song popularity (0-100) where higher is better", "Unique ID assigned to each album",
"Album name that song is on", "Date when album was released",
"Name of playlist that has song on it",
"Unique ID of playlist", "Genre of the playlist", "Subgenre of the playlist",
"How suitable a track is for dancing (0 is least danceable and 1 is most danceable",
"Energy represents the measure of intensity and activity (range from 0 to 1)",
"Overall key of the track (0 = C, 1 = C#, 2 = D, etc. -1 = no key detected)",
"Loudness of a track in decibels (dB)", "Modality of a track (major = 1, minor = 0)",
"Presence of spoken words in a track (range from 0 to 1)",
"Confidence measure from 0 to 1 of whether the track is acoustic",
"Predicts whether a track contains vocals or not (1 indicates no vocals, 0 is high vocals)",
"Detects the presence of an audience in the recording (values above .8 are most likely live tracks",
"Describes positivity of a song (1 is cheerful, 0 is sad or angry)",
"Tempo of a track in beats per minute (BPM)", "Duration of song in milliseconds")
Variable.name1 <- colnames(spotify)
data.desc <- as_tibble(cbind(Variable.name1, Variable.type, Variable.desc))
colnames(data.desc) <- c("Variable Name", "Data Type", "Variable Description")
kable(data.desc)
| Variable Name | Data Type | Variable Description |
|---|---|---|
| track_id | character | Unique ID assigned to each song |
| track_name | character | Song name |
| track_artist | character | Song artist |
| track_popularity | numeric | Song popularity (0-100) where higher is better |
| track_album_id | character | Unique ID assigned to each album |
| track_album_name | character | Album name that song is on |
| track_album_release_date | character | Date when album was released |
| playlist_name | character | Name of playlist that has song on it |
| playlist_id | character | Unique ID of playlist |
| playlist_genre | character | Genre of the playlist |
| playlist_subgenre | character | Subgenre of the playlist |
| danceability | numeric | How suitable a track is for dancing (0 is least danceable and 1 is most danceable |
| energy | numeric | Energy represents the measure of intensity and activity (range from 0 to 1) |
| key | numeric | Overall key of the track (0 = C, 1 = C#, 2 = D, etc. -1 = no key detected) |
| loudness | numeric | Loudness of a track in decibels (dB) |
| mode | numeric | Modality of a track (major = 1, minor = 0) |
| speechiness | numeric | Presence of spoken words in a track (range from 0 to 1) |
| acousticness | numeric | Confidence measure from 0 to 1 of whether the track is acoustic |
| instrumentalness | numeric | Predicts whether a track contains vocals or not (1 indicates no vocals, 0 is high vocals) |
| liveness | numeric | Detects the presence of an audience in the recording (values above .8 are most likely live tracks |
| valence | numeric | Describes positivity of a song (1 is cheerful, 0 is sad or angry) |
| tempo | numeric | Tempo of a track in beats per minute (BPM) |
| duration_ms | numeric | Duration of song in milliseconds |
The first part of data cleaning is removing the outliers. One outlier is song duration. There is one song that is 4 seconds, and multiple songs around 30 seconds. We are going to remove the 4 second song, but keep the 30 second songs as many of these are interludes and they could hint at if an entire album is popular, as if an entire album is popular, people would listen all the way through, including the interludes. The longest song is 8 minutes and 37 seconds, which we will keep.
The second part of data cleaning is removing any abnormalities. We are also going to remove a few songs that go above 0 dB as this is considered abnormal.
The third part of data cleaning is accessing any missing values. We also have a total of 15 missing values that are spread across 5 rows. Since we aren’t sure what they are and since the popularity is 0 for all 5 rows, we decided to delete them.
spotify <- spotify[!(spotify$duration_ms < 5000), ] # removing the 4 second song
spotify <- spotify[!(spotify$loudness > 0), ] # removing songs over 0 dB (considered too loud)
colSums(is.na(spotify)) %>%
pander()
| track_id | track_name | track_artist | track_popularity | track_album_id |
|---|---|---|---|---|
| 0 | 5 | 5 | 0 | 0 |
| track_album_name | track_album_release_date | playlist_name | playlist_id |
|---|---|---|---|
| 5 | 0 | 0 | 0 |
| playlist_genre | playlist_subgenre | danceability | energy | key | loudness |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 |
| mode | speechiness | acousticness | instrumentalness | liveness | valence |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 |
| tempo | duration_ms |
|---|---|
| 0 | 0 |
spotify <- spotify[!is.na(spotify$track_name), ] # remove NA values
There are also 4477 duplicates of songs (using the track_id variable) due to the same song being on multiple playlists. After examining the data, the playlist was the only difference between these duplicates (all other values remained the same), so we determined that it was okay to remove these duplicates.
However, before we removed the duplicates, we did some manipulation of the data so we could keep the information from those observations. We gathered the data to see which track_id’s appeared on the most playlists and which playlists they appeared on. We then counted the number of duplicates, and count the frequency of the number of duplicates, as this can give us a hint at what makes a song popular.
We also created a new identification variable, so that we could have an observation for every playlist_genre the song appeared on, and a count of repetitions for that genre.
spotify_dup_count <- spotify %>%
count(track_id, sort=TRUE) %>%
rename(dup_count = n) #creating a tibble to store track_id's and number of duplicates
spotify_freq_count <- spotify_dup_count %>%
count(dup_count,sort=TRUE) %>%
rename(freq_count = n) #creating a tibble to store number of duplicates and their frequencies
# Adding counts of song duplicates and duplicate frequencies to full data set
spotify_dup <- spotify %>%
full_join(spotify_dup_count, by = "track_id") %>%
full_join(spotify_freq_count, by = "dup_count")
# Creating variable another identification variable (track_genre_id), counting duplicates of that, and adding to full data set
spotify_id_genre <- spotify_dup %>%
mutate(track_genre_id = paste(track_id,playlist_genre))
id_genre_count <- spotify_id_genre %>%
count(track_genre_id) %>%
rename(id_genre_dup = n)
spotify_full <- spotify_id_genre %>%
full_join(id_genre_count, by = "track_genre_id")
# Proving that it worked
spotify_full %>%
select(track_name, playlist_genre,dup_count, id_genre_dup)%>%
filter(dup_count == 10) %>%
pander()
| track_name | playlist_genre | dup_count | id_genre_dup |
|---|---|---|---|
| Closer (feat. Halsey) | pop | 10 | 4 |
| Closer (feat. Halsey) | pop | 10 | 4 |
| Closer (feat. Halsey) | pop | 10 | 4 |
| Closer (feat. Halsey) | pop | 10 | 4 |
| Closer (feat. Halsey) | rap | 10 | 1 |
| Closer (feat. Halsey) | latin | 10 | 3 |
| Closer (feat. Halsey) | latin | 10 | 3 |
| Closer (feat. Halsey) | latin | 10 | 3 |
| Closer (feat. Halsey) | r&b | 10 | 1 |
| Closer (feat. Halsey) | edm | 10 | 1 |
Doing this will give us three datasets, one dedicated to the entire dataset and the new variables, one where there is a single observation of each song, and one where this is a single observation of a song for each playlist genre it appeared in. We wanted the spotify dataset to consist of only single observations, so that when analysis is done, the overall output is not skewed from having multiple entries of the same song. We also wanted the spotify_id_genre dataset to consist of only of single observation of a song for each playlist genre, so that when additional analysis is done, the output is not skewed due to observations of playlist_genre being removed. spotify_full is our full dataset with the addition of the new variables.
# Removing removing the desired duplicates from spotify and spotify_id_genre
spotify_id_genre <- spotify_full[!duplicated(spotify_full$track_genre_id),]
spotify <- spotify[!duplicated(spotify$track_id), ]
After cleaning the data, we are left with 32821 rows and 27 columns in spotify_full, our complete dataset. In spotify, we are left with 28345 rows and 23 columns. In spotify_id_genre, we are left with 30373 rows and 27 columns.
datatable(head(spotify,10))
datatable(head(spotify_id_genre,10))
Preview of a single song and the genre of playlists it appears in from spotify_id_genre.
spotify_id_genre %>%
select(track_name, playlist_genre, dup_count, id_genre_dup)%>%
filter(dup_count == 10)%>%
pander()
| track_name | playlist_genre | dup_count | id_genre_dup |
|---|---|---|---|
| Closer (feat. Halsey) | pop | 10 | 4 |
| Closer (feat. Halsey) | rap | 10 | 1 |
| Closer (feat. Halsey) | latin | 10 | 3 |
| Closer (feat. Halsey) | r&b | 10 | 1 |
| Closer (feat. Halsey) | edm | 10 | 1 |
The songs on the Spotify dataset are pretty evenly divided among genres which will make this analysis a lot easier to work with. We can see that rap has the most songs in the dataset.
spotify %>%
count(playlist_genre) %>%
kable()
| playlist_genre | n |
|---|---|
| edm | 4875 |
| latin | 4136 |
| pop | 5132 |
| r&b | 4504 |
| rap | 5395 |
| rock | 4303 |
We broke up our analysis into two parts. The first part investigates the spotify dataset to see what variables have a greater impact on a song’s popularity. The second part is a continuation of the first, where we investigate the spotify_id_genre data to see if those impactful variables have an effect on the number of playlists a song is on.
This dataset was broken down into 6 different genres: edm, latin, pop, r&b, rap, and rock. We decided to divide the data into these different genres and see what variables may make a song in a particular genre more popular.
First, we decided to see if there was any correlation between variables and track popularity based on genre:
genre_cor <- spotify %>%
split(.$playlist_genre) %>%
map(~{
cor(.x[12:23], .x$track_popularity)
})
genre_cor %>%
imap(~{
.x <- .x %>%
as_tibble(rownames = 'measures')
colnames(.x)[2] <- .y
.x
}) %>%
reduce(inner_join, by = 'measures') %>%
kable()
| measures | edm | latin | pop | r&b | rap | rock |
|---|---|---|---|---|---|---|
| danceability | 0.0075376 | 0.0269757 | 0.0905810 | -0.0499501 | 0.1350561 | 0.0630089 |
| energy | -0.0668703 | -0.0949504 | -0.0550890 | -0.1154659 | -0.1203482 | -0.0497579 |
| key | 0.0209148 | -0.0136671 | 0.0030875 | -0.0346284 | -0.0092971 | -0.0159771 |
| loudness | 0.0271987 | 0.1313889 | 0.1147850 | 0.0798584 | -0.0374191 | 0.0188584 |
| mode | 0.0007403 | 0.0368875 | 0.0026690 | 0.0415122 | -0.0312459 | -0.0016235 |
| speechiness | 0.0313315 | 0.0275124 | 0.0905473 | -0.0199228 | -0.0788789 | -0.0011890 |
| acousticness | 0.1564317 | 0.1233711 | 0.0289534 | 0.0970152 | 0.0754517 | 0.0046411 |
| instrumentalness | -0.1560880 | -0.1166890 | -0.1409988 | -0.0620429 | 0.0293415 | -0.0983897 |
| liveness | 0.0165836 | -0.0454335 | -0.0003136 | -0.0608468 | -0.0603816 | -0.1020059 |
| valence | 0.0896623 | 0.0061285 | 0.0207726 | -0.1292518 | -0.0275776 | -0.0097332 |
| tempo | -0.0211804 | 0.0349201 | -0.0353436 | 0.0235761 | 0.0493715 | -0.0028275 |
| duration_ms | -0.2342470 | -0.0748685 | -0.1460902 | -0.1421890 | -0.1373292 | -0.0311505 |
As you can see from the results, there really is no correlation between the measurable variables and track popularity.
We did not give up here though. We decided to run a regression analysis to determine how each variable impacts the track popularity:
panderOptions('knitr.auto.asis', FALSE)
genre_lm <- spotify %>%
split(.$playlist_genre) %>%
map(~{
lm(track_popularity ~ ., data = .x[c(4, 12:23)])
})
genre_lm %>%
iwalk(~{
cat(.y, '\n')
pander(.x)
})
edm
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 43.11 | 4.432 | 9.725 | 3.739e-22 |
| danceability | 5.465 | 2.659 | 2.056 | 0.03988 |
| energy | -6.782 | 2.996 | -2.264 | 0.02364 |
| key | 0.1818 | 0.07972 | 2.28 | 0.02265 |
| loudness | 0.08157 | 0.1664 | 0.4903 | 0.6239 |
| mode | 0.5928 | 0.5706 | 1.039 | 0.2988 |
| speechiness | -2.64 | 4.035 | -0.6541 | 0.5131 |
| acousticness | 14.68 | 2.187 | 6.713 | 2.128e-11 |
| instrumentalness | -4.652 | 0.9684 | -4.804 | 1.601e-06 |
| liveness | 0.7648 | 1.631 | 0.4689 | 0.6391 |
| valence | 2.747 | 1.377 | 1.994 | 0.04616 |
| tempo | 0.003207 | 0.01991 | 0.1611 | 0.8721 |
| duration_ms | -5.656e-05 | 4.193e-06 | -13.49 | 9.828e-41 |
latin
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 72.16 | 4.777 | 15.11 | 3.12e-50 |
| danceability | 3.987 | 3.474 | 1.148 | 0.2511 |
| energy | -30.39 | 3.259 | -9.324 | 1.778e-20 |
| key | -0.05505 | 0.0982 | -0.5606 | 0.5751 |
| loudness | 1.93 | 0.1556 | 12.41 | 9.784e-35 |
| mode | 1.096 | 0.7177 | 1.527 | 0.1268 |
| speechiness | -0.3757 | 4.134 | -0.09089 | 0.9276 |
| acousticness | 9.304 | 1.842 | 5.05 | 4.604e-07 |
| instrumentalness | -7.475 | 2.035 | -3.673 | 0.0002425 |
| liveness | -2.971 | 2.358 | -1.26 | 0.2077 |
| valence | 0.3141 | 1.789 | 0.1756 | 0.8606 |
| tempo | 0.03298 | 0.01277 | 2.583 | 0.009838 |
| duration_ms | -2.242e-05 | 7.123e-06 | -3.148 | 0.001657 |
pop
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 79.42 | 4.404 | 18.03 | 1.579e-70 |
| danceability | 12.42 | 2.927 | 4.243 | 2.247e-05 |
| energy | -28.28 | 3.166 | -8.931 | 5.808e-19 |
| key | 0.08428 | 0.09225 | 0.9137 | 0.3609 |
| loudness | 2.013 | 0.1828 | 11.01 | 6.813e-28 |
| mode | 0.2896 | 0.6859 | 0.4223 | 0.6728 |
| speechiness | 23.14 | 4.993 | 4.635 | 3.661e-06 |
| acousticness | 0.1527 | 1.896 | 0.0805 | 0.9358 |
| instrumentalness | -8.585 | 1.897 | -4.525 | 6.186e-06 |
| liveness | 1.082 | 2.477 | 0.437 | 0.6621 |
| valence | -0.3988 | 1.741 | -0.229 | 0.8189 |
| tempo | -0.01279 | 0.01398 | -0.9145 | 0.3605 |
| duration_ms | -4.144e-05 | 7.656e-06 | -5.412 | 6.506e-08 |
r&b
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 74.48 | 3.972 | 18.75 | 1.372e-75 |
| danceability | 0.1796 | 2.922 | 0.06147 | 0.951 |
| energy | -21.84 | 2.936 | -7.438 | 1.218e-13 |
| key | -0.1042 | 0.09635 | -1.081 | 0.2795 |
| loudness | 1.486 | 0.1519 | 9.786 | 2.169e-22 |
| mode | 1.27 | 0.6973 | 1.822 | 0.06856 |
| speechiness | -5.692 | 3.316 | -1.716 | 0.08615 |
| acousticness | 2.909 | 1.641 | 1.773 | 0.07632 |
| instrumentalness | -9.176 | 2.876 | -3.191 | 0.001427 |
| liveness | -7.086 | 2.443 | -2.9 | 0.003745 |
| valence | -7.046 | 1.848 | -3.813 | 0.0001389 |
| tempo | 0.01905 | 0.01211 | 1.573 | 0.1159 |
| duration_ms | -4.555e-05 | 6.026e-06 | -7.558 | 4.932e-14 |
rap
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 46.61 | 3.929 | 11.86 | 4.591e-32 |
| danceability | 21.97 | 2.47 | 8.893 | 7.963e-19 |
| energy | -15.03 | 2.847 | -5.28 | 1.34e-07 |
| key | -0.04076 | 0.08405 | -0.485 | 0.6277 |
| loudness | 0.5248 | 0.1481 | 3.545 | 0.0003963 |
| mode | -1.258 | 0.6246 | -2.014 | 0.04409 |
| speechiness | -11.45 | 2.357 | -4.857 | 1.225e-06 |
| acousticness | 4.171 | 1.59 | 2.624 | 0.008725 |
| instrumentalness | -2.799 | 1.565 | -1.789 | 0.07373 |
| liveness | -0.659 | 2.101 | -0.3136 | 0.7538 |
| valence | -2.282 | 1.472 | -1.55 | 0.1212 |
| tempo | 0.04971 | 0.009637 | 5.158 | 2.579e-07 |
| duration_ms | -4.297e-05 | 5.407e-06 | -7.948 | 2.296e-15 |
rock
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 55.16 | 4.545 | 12.14 | 2.396e-33 |
| danceability | 11.71 | 3.301 | 3.547 | 0.0003936 |
| energy | -16.3 | 3.517 | -4.634 | 3.694e-06 |
| key | -0.1009 | 0.1045 | -0.9648 | 0.3347 |
| loudness | 0.6925 | 0.1737 | 3.987 | 6.794e-05 |
| mode | -0.2162 | 0.8031 | -0.2692 | 0.7878 |
| speechiness | 9.512 | 8.724 | 1.09 | 0.2757 |
| acousticness | -4.155 | 2.173 | -1.912 | 0.05597 |
| instrumentalness | -11.19 | 2.048 | -5.465 | 4.903e-08 |
| liveness | -11.02 | 2.169 | -5.083 | 3.864e-07 |
| valence | -3.273 | 2.031 | -1.612 | 0.1071 |
| tempo | 0.02017 | 0.01383 | 1.458 | 0.1449 |
| duration_ms | -5.573e-06 | 5.797e-06 | -0.9613 | 0.3365 |
After focusing on the estimate values, we discovered four variables that impact track popularity the most:
Let’s take a closer look at these variables, and see how much they change the track popularity based on genre. Keep in mind that the Standard Error values are quite high for some of the variables.
dance_graph <- genre_lm %>%
map(tidy) %>%
imap(~.x %>%
mutate(genre = .y)) %>%
bind_rows() %>%
filter(term == 'danceability') %>%
ggplot(aes(x = reorder(genre, estimate), estimate)) +
geom_col(fill = 'royalblue2') +
geom_text(aes(label = round(estimate, digits = 1),
vjust = -0.3)) +
labs(title = 'Impact of Danceability on Track Popularity',
x = 'Genre', y = 'Estimate') +
theme(plot.title = element_text(size = rel(1.4), face = "bold",
color = "royalblue2"),
panel.background = element_rect(fill = "white"))
dance_graph
According to this graph, if we increase the danceability of a rap song by 1, the track’s popularity will grow by 23.9 points. However, danceability is on a range from 0 to 1, so the highest it can go is 1. Regardless, it does seem that if you want to slightly increase the popularity of your rap song, you might want to increase the danceability a bit.
energy_graph <- genre_lm %>%
map(tidy) %>%
imap(~.x %>%
mutate(genre = .y)) %>%
bind_rows() %>%
filter(term == 'energy') %>%
ggplot(aes(x = reorder(genre, estimate), estimate)) +
geom_col(fill = 'springgreen3') +
geom_text(aes(label = round(estimate, digits = 1), vjust = 1.2)) +
labs(title = 'Impact of Energy on Track Popularity',
x = 'Genre', y = 'Estimate') +
theme(plot.title = element_text(size = rel(1.4), face = "bold",
color = "springgreen3"),
panel.background = element_rect(fill = "white"))
energy_graph
Energy can impact a track’s popularity by a lot, especially if the track is in the Latin genre. We were pretty surprised to see these results, as raising the energy in Latin and pop song may generate a loss of popularity. We typically think of Latin and pop music as energetic, but the genres do have a large range of tracks, and the standard error for this regression was quite high.
speechiness_graph <- genre_lm %>%
map(tidy) %>%
imap(~.x %>%
mutate(genre = .y)) %>%
bind_rows() %>%
filter(term == 'speechiness') %>%
ggplot(aes(x = reorder(genre, estimate), estimate)) +
geom_col(fill = 'orangered3') +
geom_text(aes(label = round(estimate, digits = 1), vjust = 1.2)) +
labs(title = 'Impact of Speechiness on Track Popularity',
x = 'Genre', y = 'Estimate') +
theme(plot.title = element_text(size = rel(1.4), face = "bold",
color = "orangered3"),
panel.background = element_rect(fill = "white"))
speechiness_graph
By looking at this graph, speechiness (the presence of talking) can positively impact a track’s popularity, or negatively impact a track’s popularity depending on the genre. Speechiness can help a pop song, but hurt a rap song, which is the opposite that we were thinking it would do, as rap songs typically have a high speechiness rate.
instrumentalness_graph <- genre_lm %>%
map(tidy) %>%
imap(~.x %>%
mutate(genre = .y)) %>%
bind_rows() %>%
filter(term == 'instrumentalness') %>%
ggplot(aes(x = reorder(genre, estimate), estimate)) +
geom_col(fill = 'purple3') +
geom_text(aes(label = round(estimate, digits = 1), vjust = 1.2)) +
labs(title = 'Impact of Instrumentalness on Track Popularity',
x = 'Genre', y = 'Estimate') +
theme(plot.title = element_text(size = rel(1.4), face = "bold",
color = "purple3"),
panel.background = element_rect(fill = "white"))
instrumentalness_graph
It’s clear to see that increasing the instrumentalness will negatively impact a track’s popularity, regardless of the genre. We were most surprised to see the rock genre leading on this graph since we typically associate rock with a lot of guitar solos at times.
spotify_id_genre AnalysisJust like the spotify dataset, the spotify_id_genre and spotify_full datasets still can be broken down into the 6 different playlist genres, but it can also be broken down by the number of times a song is repeated. This allows us to see the distribution between genres and the repetition of songs. We used the spotify_full to generate this, so that we had the most accurate counts for the dataset.
subset <- spotify_full %>%
select(dup_count,playlist_genre)%>%
group_by(dup_count)%>%
count(playlist_genre)%>%
rename(genre_count = n)
count_table <- subset %>%
spread(playlist_genre,genre_count)
kable(count_table)
| dup_count | edm | latin | pop | r&b | rap | rock |
|---|---|---|---|---|---|---|
| 1 | 4511 | 3729 | 3988 | 4270 | 4822 | 3860 |
| 2 | 1013 | 770 | 796 | 729 | 624 | 834 |
| 3 | 296 | 308 | 352 | 192 | 171 | 211 |
| 4 | 102 | 140 | 157 | 79 | 59 | 31 |
| 5 | 45 | 79 | 86 | 61 | 18 | 11 |
| 6 | 36 | 51 | 61 | 39 | 22 | 1 |
| 7 | 17 | 38 | 29 | 27 | 8 | NA |
| 8 | 18 | 30 | 29 | 30 | 12 | 1 |
| 9 | 2 | 5 | 5 | 3 | 3 | NA |
| 10 | 1 | 3 | 4 | 1 | 1 | NA |
Because of the sheer numbers, we decided to continue the rest of our analysis by focusing on the duplicate counts of 3, 4, and 5. This is for a few reasons:
Our analysis of the spotify dataset indicated 4 main variables that are the most impactful to the popularity of a track.
These are:
Our analysis of the spotify_id_genre dataset will take a closer look at these variables:
# Track_popularity vs. Danceability
ggplot(subset(spotify_id_genre, dup_count %in% c(3:5)), aes(x=danceability,y=track_popularity, color=playlist_genre))+
geom_point()+
geom_smooth()+
coord_cartesian(xlim=c(0.0,1.0),ylim= c(0,100))+
labs(title = 'Danceability vs. Track Popularity', subtitle = 'Faceted by Genre & 3,4, & 5 Duplicates', x = 'Danceability', y = 'Track Popularity', fill = "Playlist Genre")+
theme(plot.title = element_text(size=rel(1.2), face = "bold"))+
facet_grid(playlist_genre~dup_count)
For songs that appeared on 3, 4, and 5 different playlists, their danceability ranged between 0.25 and 0.90. The trend line for track_popularity versus danceability is always in the higher ranges for both axis, indicating that a high track popularity and a high danceability will likely appear on multiple playlists.
#Energy
ggplot(subset(spotify_id_genre, dup_count %in% c(3:5)), aes(x=energy,y=track_popularity, color=playlist_genre))+
geom_point()+
geom_smooth()+
coord_cartesian(xlim=c(0.0,1.0),ylim= c(0,100))+
labs(title = 'Energy vs. Track Popularity', subtitle = 'Faceted by Genre & 3,4, & 5 Duplicates', x = 'Energy', y = 'Track Popularity', fill = "Playlist Genre")+
theme(plot.title = element_text(size=rel(1.2), face = "bold"))+
facet_grid(playlist_genre~dup_count)
For songs that appeared on 3, 4, and 5 different playlists, their energy ranges depended more on the genre. The trend line for track_popularity versus energy is consistently in the higher ranges for both however. This indicates a different finding from our spotify analysis, where increasing in energy caused a decrease in popularity. In this aspect of analysis, a high popularity and a high energy, would be a good indicator of a track appearing on multiple playlists.
#Speechiness
ggplot(subset(spotify_id_genre, dup_count %in% c(3:5)), aes(x=speechiness,y=track_popularity, color=playlist_genre))+
geom_point()+
geom_smooth()+
coord_cartesian(xlim=c(0.0,1.0),ylim= c(0,100))+
labs(title = 'Speechiness vs. Track Popularity', subtitle = 'Faceted by Genre & 3,4, & 5 Duplicates', x = 'Speechiness', y = 'Track Popularity', fill = "Playlist Genre")+
theme(plot.title = element_text(size=rel(1.2), face = "bold"))+
facet_grid(playlist_genre~dup_count)
For speechiness, values ranged from 0.0 to around 0.50, with the highest density of points ranging from 0.0 to 0.125.These values indicate that the songs with mostly music and some speech and have a high popularity are the most likely to appear on multiple playlists.
#Instrumentalness
ggplot(subset(spotify_id_genre, dup_count %in% c(3:5)), aes(x=instrumentalness,y=track_popularity, color=playlist_genre))+
geom_point()+
geom_smooth()+
coord_cartesian(xlim=c(0.0,1.0),ylim= c(-400,400))+
labs(title = 'Instrumentalness vs. Track Popularity', subtitle = 'Faceted by Genre & 3,4, & 5 Duplicates', x = 'Instrumentalness', y = 'Track Popularity', fill = "Playlist Genre")+
theme(plot.title = element_text(size=rel(1.2), face = "bold"))+
facet_grid(playlist_genre~dup_count)
For instrumentalness, the trend lines appear to be all over the place, so it is difficult to tell how high popularity and instrumentalness impact the likelihood of a track being on multiple playlists.
One interesting thing to note is the ranges for instrumentalness by genre are pretty consistent for the number of playlists a song appears on.
Before digging deep into this dataset, we decided we wanted to see what might impact a song’s popularity. Analyzing this data helped us understand a few factors that may help make a song popular depending on the genre. We also discovered that rap songs appear the most in this dataset, which is one thing we wanted to find.
Despite not finding any correlation between genres and track popularity, we did find useful information during our regression analysis:
What helps make a song popular? spotify analysis:
danceability may increase a track’s popularity of the rap, pop, and rock genresenergy may decrease a track’s popularity of all genres, while impacting latin, pop, and r&b songs the mostspeechiness may decrease the popularity of a rap song, but increase the popularity of a pop songintrumentalness may decrease a track’s popularity, especially a rock songWhat helps make a song appear on multiple playlists? spotify_id_genre analysis:
track_popularitydanceabilityenergyspeechinessWhile these are good indicators, they are not absolutely perfect, because there are many songs in the dataset that have all of these characteristics and they only appear on one playlist.
The analysis of this data set can help artists create songs that may become popular and appear on multiple playlists. This can help the artist and affiliates increase their revenue.
One limitation of our analysis was that we could not compare a track’s genre to the genre of playlists that it appeared on. We think that this could be useful for our consumers because, through analysis of the numerical variables, could indicate what qualities of a song are across the most genres. This could be done by adding a variable,track_genre, and by running analysis similar to we have done.
If we could redo this analysis, we would probably look into using sub-genres instead of genres, as there were only 6 genres, and they were pretty broad. Doing so may show more insights and more accurate results of what makes a song popular.