Synopsis

I will be attempting to understand what makes songs from Spotify so popular while also looking into why some songs are not as popular. I think it is interesting to compare some quantitative/quantitative data about the songs, such as the energy it has, its speechiness, genre and so forth and see if certain ranges within this data could be a possible reason as to why a song is popular,

To conduct my analysis, I will compare the most popular music according to its popularity rating with those that are the least of popularity. This comparison will measure some of the qualitative and quantitative traits of these 2 groups and try to see if there is a notable difference among them.

By taking part in this analysis for Spotify, I hope to possibly pinpoint the popular songs by noticing a pattern of characteristics for an enhanced song recommendation for users. More popular songs generate more playbacks, and adding more songs that fit the characteristics of a “popular song” can bring in more users. It is also important to note that this does not mean we should exclude all the unpopular songs in favor of the popular ones, but this analysis could give us some insight as to what kinds of songs users like, and use that info to bring in more.

Data Cleaning

First I will find how many missing value there are and remove them to create a data set with no missing values. Then I will remove duplicated songs, and I will also remove any remix songs because those will skew the results. Remix songs tend to either be a hit or miss, and a low remix song could make the average popularity of an artist go down

colSums(is.na(raw_data)) 
# counts the NA values in each column

complete_data<- na.omit(raw_data) 
# creates new data frame with non NA observations

colSums(is.na(complete_data)) 
#counts the NA values in the new data frame

unique_data <- distinct(complete_data,track_name, .keep_all= TRUE) 
# removes observations with the same song title

data_without_remix <- unique_data[!grepl("Remix",unique_data$track_name),]
# removes the remix songs

clean_data <- select(data_without_remix, -c(track_album_id, playlist_id, track_id))
#removes unique identifier columns for each track and playlist

Data Cleaning Results

The original raw data set had 32833 observations with 23 variables
The clean data set has 21777 observations with 20 variables
a reduction of 11056 observations or about 34%
a reduction of 3 variables

Data Table

datatable(top_n(clean_data,10,track_name), caption = 'Top 10 Observations')

Summary Information for variables of concern

# Danceability
clean_data %>% 
  summarise(average= mean(danceability), max = max(danceability), min = min(danceability), 
            first_quartile =quantile(danceability, 0.25), median = median(danceability),
            third_quartile = quantile(danceability, .75), sd = sd(danceability)) 

# loudness
clean_data %>%
summarise(average= mean(loudness), max = max(loudness), min = min(loudness), 
          first_quartile =quantile(loudness, 0.25), median = median(loudness),
          third_quartile = quantile(loudness, .75),sd = sd(loudness))

# energy
clean_data %>%
  summarise(average= mean(energy), max = max(energy), min = min(energy), 
            first_quartile =quantile(energy, 0.25), median = median(energy),
            third_quartile = quantile(energy, .75),sd = sd(energy))

# speechiness
clean_data %>%
  summarise(average= mean(speechiness), max = max(speechiness), min = min(speechiness), 
            first_quartile =quantile(speechiness, 0.25), median = median(speechiness),
            third_quartile = quantile(speechiness, .75),sd = sd(speechiness))

# acousticness
clean_data %>%
  summarise(average= mean(acousticness), max = max(acousticness), min = min(acousticness), 
            first_quartile =quantile(acousticness, 0.25), median = median(acousticness),
            third_quartile = quantile(acousticness, .75),sd = sd(acousticness))

#  instrumentalness
clean_data %>%
  summarise(average= mean(instrumentalness), max = max(instrumentalness), min = min(instrumentalness), 
            first_quartile =quantile(instrumentalness, 0.25), median = median(instrumentalness),
            third_quartile = quantile(instrumentalness, 0.75),sd = sd(instrumentalness))

# valence 
clean_data %>%
  summarise(average= mean(valence), max = max(valence), min = min(valence), 
            first_quartile =quantile(valence, 0.25), median = median(valence),
            third_quartile = quantile(valence, .75),sd = sd(valence))

# tempo
clean_data %>%
   summarise(average= mean(tempo), max = max(tempo), min = min(tempo), 
             first_quartile =quantile(tempo, 0.25), median = median(tempo),
             third_quartile = quantile(tempo, .75),sd = sd(tempo))

# duration
clean_data %>%
   summarise(average= mean(duration_ms), max = max(duration_ms), min = min(duration_ms), 
             first_quartile =quantile(duration_ms, 0.25), median = median(duration_ms),
             third_quartile = quantile(duration_ms, .75),sd = sd(duration_ms))

Summarizing Qualitative Data

# grouping data by genre
by_genre <- group_by(clean_data, playlist_genre) 


# finding averages of a few measurable traits by genre
summarise(by_genre, average_danceability = mean(danceability), average_energy = mean(energy), average_loudness = mean(loudness))

## # A tibble: 6 x 4
##   playlist_genre average_danceability average_energy average_loudness
##   <chr>                         <dbl>          <dbl>            <dbl>
## 1 edm                           0.664          0.809            -5.61
## 2 latin                         0.708          0.706            -6.61
## 3 pop                           0.634          0.697            -6.48
## 4 r&b                           0.664          0.587            -8.04
## 5 rap                           0.717          0.647            -7.17
## 6 rock                          0.519          0.729            -7.61

Proposed Exploratory Data Analysis

I could break up the data based on popularity. For example, I could filter the data and make 2 data frames that contain the top 25% of music by ratings and the bottom 25% of ratings. I could then compare the qualitative and quantitative metrics associated with these songs and try and see if there is a discrepancy among the 2 groups. I also plan on dissecting the data based on genre and see which ones are the most popular, and also measure their qualitative/quantitative metrics and compare to the least popular genres. I could also do the same for artists and isolate the most popular (by taking the mean of all their songs popularity) and the least popular artists and compare the song metrics to each other.

I could use boxplots to visualize the statistical comparisons, as well as histograms to measure the count of songs with regards to popularity. I could also use a scatter plot with a facet wrap to differentiate between the genres.

I do not know how to consolidate my code to make it more efficient because a lot of it is repeated, and I hope to find a way to create a function or learn new rules that could make it more streamlined. I also do not know how to work with advanced visuals, and I hope to learn the more detailed aspects of this process to create better looking graphs.

Spotify Song Popularity

Miguel

11/19/2020

Synopsis

Packages Required

Data Preparation

Loading the Data

Describing initial data set