Synopsis

I will be attempting to understand what makes songs from Spotify so popular while also looking into why some songs are not as popular. I think it is interesting to compare some quantitative/quantitative data about the songs, such as the energy it has, its speechiness, genre and so forth and see if certain ranges within this data could be a possible reason as to why a song is popular,

To conduct my analysis, I will compare the most popular music according to its popularity rating with those that are the least of popularity. This comparison will measure some of the qualitative and quantitative traits of these 2 groups and try to see if there is a notable difference among them.

By taking part in this analysis for Spotify, I hope to possibly pinpoint the popular songs by noticing a pattern of characteristics for an enhanced song recommendation for users. More popular songs generate more playbacks, and adding more songs that fit the characteristics of a “popular song” can bring in more users. It is also important to note that this does not mean we should exclude all the unpopular songs in favor of the popular ones, but this analysis could give us some insight as to what kinds of songs users like, and use that info to bring in more.

Packages Required

library(ggplot2) # for visualizing the data
library(dplyr) # for data manipulation
library(tibble) #to create tibble date
library(tidyr) #to create variable data from music genre
library(magrittr) # for simplifying code with pipes
library(DT) # Loads visual date table

Data Preparation

Before we can understand what makes a song on Spotify popular, we must clean our data set

Loading the Data

raw_data <- read.csv("spotify_songs.csv")

Describing initial data set

str(raw_data)

The data set has 23 variables, and empty rows are stored as NA.

Data Cleaning

First I will find how many missing value there are and remove them to create a data set with no missing values. Then I will remove duplicated songs, and I will also remove any remix songs because those will skew the results. Remix songs tend to either be a hit or miss, and a low remix song could make the average popularity of an artist go down

colSums(is.na(raw_data)) 
# counts the NA values in each column

complete_data<- na.omit(raw_data) 
# creates new data frame with non NA observations

colSums(is.na(complete_data)) 
#counts the NA values in the new data frame

unique_data <- distinct(complete_data,track_name, .keep_all= TRUE) 
# removes observations with the same song title

data_without_remix <- unique_data[!grepl("Remix",unique_data$track_name),]
# removes the remix songs

clean_data <- select(data_without_remix, -c(track_album_id, playlist_id, track_id))
#removes unique identifier columns for each track and playlist

Data Cleaning Results

  • The original raw data set had 32833 observations with 23 variables
  • The clean data set has 21777 observations with 20 variables
  • a reduction of 11056 observations or about 34%
  • a reduction of 3 variables

Data Table

datatable(top_n(clean_data,10,track_name), caption = 'Top 10 Observations')

Summary Information for variables of concern

# Danceability
clean_data %>% 
  summarise(average= mean(danceability), max = max(danceability), min = min(danceability), 
            first_quartile =quantile(danceability, 0.25), median = median(danceability),
            third_quartile = quantile(danceability, .75), sd = sd(danceability)) 

# loudness
clean_data %>%
summarise(average= mean(loudness), max = max(loudness), min = min(loudness), 
          first_quartile =quantile(loudness, 0.25), median = median(loudness),
          third_quartile = quantile(loudness, .75),sd = sd(loudness))

# energy
clean_data %>%
  summarise(average= mean(energy), max = max(energy), min = min(energy), 
            first_quartile =quantile(energy, 0.25), median = median(energy),
            third_quartile = quantile(energy, .75),sd = sd(energy))

# speechiness
clean_data %>%
  summarise(average= mean(speechiness), max = max(speechiness), min = min(speechiness), 
            first_quartile =quantile(speechiness, 0.25), median = median(speechiness),
            third_quartile = quantile(speechiness, .75),sd = sd(speechiness))

# acousticness
clean_data %>%
  summarise(average= mean(acousticness), max = max(acousticness), min = min(acousticness), 
            first_quartile =quantile(acousticness, 0.25), median = median(acousticness),
            third_quartile = quantile(acousticness, .75),sd = sd(acousticness))

#  instrumentalness
clean_data %>%
  summarise(average= mean(instrumentalness), max = max(instrumentalness), min = min(instrumentalness), 
            first_quartile =quantile(instrumentalness, 0.25), median = median(instrumentalness),
            third_quartile = quantile(instrumentalness, 0.75),sd = sd(instrumentalness))

# valence 
clean_data %>%
  summarise(average= mean(valence), max = max(valence), min = min(valence), 
            first_quartile =quantile(valence, 0.25), median = median(valence),
            third_quartile = quantile(valence, .75),sd = sd(valence))

# tempo
clean_data %>%
   summarise(average= mean(tempo), max = max(tempo), min = min(tempo), 
             first_quartile =quantile(tempo, 0.25), median = median(tempo),
             third_quartile = quantile(tempo, .75),sd = sd(tempo))

# duration
clean_data %>%
   summarise(average= mean(duration_ms), max = max(duration_ms), min = min(duration_ms), 
             first_quartile =quantile(duration_ms, 0.25), median = median(duration_ms),
             third_quartile = quantile(duration_ms, .75),sd = sd(duration_ms))

Summarizing Qualitative Data

# grouping data by genre
by_genre <- group_by(clean_data, playlist_genre) 


# finding averages of a few measurable traits by genre
summarise(by_genre, average_danceability = mean(danceability), average_energy = mean(energy), average_loudness = mean(loudness))
## # A tibble: 6 x 4
##   playlist_genre average_danceability average_energy average_loudness
##   <chr>                         <dbl>          <dbl>            <dbl>
## 1 edm                           0.664          0.809            -5.61
## 2 latin                         0.708          0.706            -6.61
## 3 pop                           0.634          0.697            -6.48
## 4 r&b                           0.664          0.587            -8.04
## 5 rap                           0.717          0.647            -7.17
## 6 rock                          0.519          0.729            -7.61

Proposed Exploratory Data Analysis

I could break up the data based on popularity. For example, I could filter the data and make 2 data frames that contain the top 25% of music by ratings and the bottom 25% of ratings. I could then compare the qualitative and quantitative metrics associated with these songs and try and see if there is a discrepancy among the 2 groups. I also plan on dissecting the data based on genre and see which ones are the most popular, and also measure their qualitative/quantitative metrics and compare to the least popular genres. I could also do the same for artists and isolate the most popular (by taking the mean of all their songs popularity) and the least popular artists and compare the song metrics to each other.

I could use boxplots to visualize the statistical comparisons, as well as histograms to measure the count of songs with regards to popularity. I could also use a scatter plot with a facet wrap to differentiate between the genres.

I do not know how to consolidate my code to make it more efficient because a lot of it is repeated, and I hope to find a way to create a function or learn new rules that could make it more streamlined. I also do not know how to work with advanced visuals, and I hope to learn the more detailed aspects of this process to create better looking graphs.