I will be attempting to understand what makes songs from Spotify so popular while also looking into why some songs are not as popular. I think it is interesting to compare some quantitative/quantitative data about the songs, such as the energy it has, its speechiness, genre and so forth and see if certain ranges within this data could be a possible reason as to why a song is popular,
To conduct my analysis, I will compare the most popular music according to its popularity rating with those that are the least of popularity. This comparison will measure some of the qualitative and quantitative traits of these 2 groups and try to see if there is a notable difference among them.
By taking part in this analysis for Spotify, I hope to possibly pinpoint the popular songs by noticing a pattern of characteristics for an enhanced song recommendation for users. More popular songs generate more playbacks, and adding more songs that fit the characteristics of a “popular song” can bring in more users. It is also important to note that this does not mean we should exclude all the unpopular songs in favor of the popular ones, but this analysis could give us some insight as to what kinds of songs users like, and use that info to bring in more.
library(ggplot2) # for visualizing the data
library(dplyr) # for data manipulation
library(tibble) #to create tibble date
library(tidyr) #to create variable data from music genre
library(magrittr) # for simplifying code with pipes
library(DT) # Loads visual date table
Before we can understand what makes a song on Spotify popular, we must clean our data set
raw_data <- read.csv("spotify_songs.csv")
str(raw_data)
The data set has 23 variables, and empty rows are stored as NA.
First I will find how many missing value there are and remove them to create a data set with no missing values. Then I will remove duplicated songs, and I will also remove any remix songs because those will skew the results. Remix songs tend to either be a hit or miss, and a low remix song could make the average popularity of an artist go down
colSums(is.na(raw_data))
# counts the NA values in each column
complete_data<- na.omit(raw_data)
# creates new data frame with non NA observations
colSums(is.na(complete_data))
#counts the NA values in the new data frame
unique_data <- distinct(complete_data,track_name, .keep_all= TRUE)
# removes observations with the same song title
data_without_remix <- unique_data[!grepl("Remix",unique_data$track_name),]
# removes the remix songs
clean_data <- select(data_without_remix, -c(track_album_id, playlist_id, track_id))
#removes unique identifier columns for each track and playlist
datatable(top_n(clean_data,10,track_name), caption = 'Top 10 Observations')
# Danceability
clean_data %>%
summarise(average= mean(danceability), max = max(danceability), min = min(danceability),
first_quartile =quantile(danceability, 0.25), median = median(danceability),
third_quartile = quantile(danceability, .75), sd = sd(danceability))
# loudness
clean_data %>%
summarise(average= mean(loudness), max = max(loudness), min = min(loudness),
first_quartile =quantile(loudness, 0.25), median = median(loudness),
third_quartile = quantile(loudness, .75),sd = sd(loudness))
# energy
clean_data %>%
summarise(average= mean(energy), max = max(energy), min = min(energy),
first_quartile =quantile(energy, 0.25), median = median(energy),
third_quartile = quantile(energy, .75),sd = sd(energy))
# speechiness
clean_data %>%
summarise(average= mean(speechiness), max = max(speechiness), min = min(speechiness),
first_quartile =quantile(speechiness, 0.25), median = median(speechiness),
third_quartile = quantile(speechiness, .75),sd = sd(speechiness))
# acousticness
clean_data %>%
summarise(average= mean(acousticness), max = max(acousticness), min = min(acousticness),
first_quartile =quantile(acousticness, 0.25), median = median(acousticness),
third_quartile = quantile(acousticness, .75),sd = sd(acousticness))
# instrumentalness
clean_data %>%
summarise(average= mean(instrumentalness), max = max(instrumentalness), min = min(instrumentalness),
first_quartile =quantile(instrumentalness, 0.25), median = median(instrumentalness),
third_quartile = quantile(instrumentalness, 0.75),sd = sd(instrumentalness))
# valence
clean_data %>%
summarise(average= mean(valence), max = max(valence), min = min(valence),
first_quartile =quantile(valence, 0.25), median = median(valence),
third_quartile = quantile(valence, .75),sd = sd(valence))
# tempo
clean_data %>%
summarise(average= mean(tempo), max = max(tempo), min = min(tempo),
first_quartile =quantile(tempo, 0.25), median = median(tempo),
third_quartile = quantile(tempo, .75),sd = sd(tempo))
# duration
clean_data %>%
summarise(average= mean(duration_ms), max = max(duration_ms), min = min(duration_ms),
first_quartile =quantile(duration_ms, 0.25), median = median(duration_ms),
third_quartile = quantile(duration_ms, .75),sd = sd(duration_ms))
# grouping data by genre
by_genre <- group_by(clean_data, playlist_genre)
# finding averages of a few measurable traits by genre
summarise(by_genre, average_danceability = mean(danceability), average_energy = mean(energy), average_loudness = mean(loudness))
## # A tibble: 6 x 4
## playlist_genre average_danceability average_energy average_loudness
## <chr> <dbl> <dbl> <dbl>
## 1 edm 0.664 0.809 -5.61
## 2 latin 0.708 0.706 -6.61
## 3 pop 0.634 0.697 -6.48
## 4 r&b 0.664 0.587 -8.04
## 5 rap 0.717 0.647 -7.17
## 6 rock 0.519 0.729 -7.61
I could break up the data based on popularity. For example, I could filter the data and make 2 data frames that contain the top 25% of music by ratings and the bottom 25% of ratings. I could then compare the qualitative and quantitative metrics associated with these songs and try and see if there is a discrepancy among the 2 groups. I also plan on dissecting the data based on genre and see which ones are the most popular, and also measure their qualitative/quantitative metrics and compare to the least popular genres. I could also do the same for artists and isolate the most popular (by taking the mean of all their songs popularity) and the least popular artists and compare the song metrics to each other.
I could use boxplots to visualize the statistical comparisons, as well as histograms to measure the count of songs with regards to popularity. I could also use a scatter plot with a facet wrap to differentiate between the genres.
I do not know how to consolidate my code to make it more efficient because a lot of it is repeated, and I hope to find a way to create a function or learn new rules that could make it more streamlined. I also do not know how to work with advanced visuals, and I hope to learn the more detailed aspects of this process to create better looking graphs.