1 Introduction

1.1 Problem Statement

The digital music industry has witnessed a significant transformation in the last decade, largely driven by streaming services like Spotify. Understanding the dynamics of music preferences, trends, and characteristics is essential for artists, record labels, and marketers in this digital age. This analysis focuses on unraveling the patterns and attributes of popular music tracks on Spotify. The goal is to explore what makes a song popular - is it the beat, the melody, the artist, or the genre? This question is intriguing as it dives deep into the analytics behind music preferences, offering insights that go beyond subjective measures of popularity.

1.2 Methodology

To address this problem, we will utilize a comprehensive dataset from Spotify, containing over 32,000 tracks with various attributes like danceability, energy, loudness, and popularity scores. Our methodology encompasses several key steps:

Data Cleaning and Preparation: We’ll start by cleaning the dataset, handling missing values, and ensuring data quality.
Exploratory Data Analysis: We will perform statistical analysis and visualization to understand the distribution and relationships among various attributes, focusing on how they correlate with track popularity. Subgroup Analysis: We will analyze specific subgroups, such as genres or artists, to identify any unique trends.
Summarization of Findings: Key insights will be summarized to highlight the characteristics of popular tracks on Spotify.

1.3 Consumer Benefit

This analysis will benefit music producers, artists, and marketers by providing data-driven insights into what makes a track successful on Spotify. Understanding these trends can help in making informed decisions about music production, marketing strategies, and artist development. For instance, if certain musical attributes consistently correlate with high popularity scores, artists and producers can focus on these elements in their creative process. Marketers and playlist curators can use these insights to target their audience more effectively, enhancing engagement and reach.

2 Packages Required

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse) # For data manipulation and visualization
library(readr)     # For reading CSV files
library(lubridate) # For handling date variables
library(ggplot2)   # For advanced graphing capabilities
library(dplyr)     # For data manipulation
library(tidyr)     # For data tidying

2.3 Reason of using

tidyverse : For data manipulation and visualization
readr : For reading CSV files
lubridate : For handling date variables
ggplot2 : For advanced graphing capabilities
dplyr : For data manipulation
tidyr : For data tidying

3 Data Preparation

3.1 Initial Data Description

setwd('~/Desktop/B BUS 301 A/Final ')
spotify_data <- read_csv("spotify_songs.csv")

## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(spotify_data) #check class of data

## spc_tbl_ [32,833 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   track_id = col_character(),
##   ..   track_name = col_character(),
##   ..   track_artist = col_character(),
##   ..   track_popularity = col_double(),
##   ..   track_album_id = col_character(),
##   ..   track_album_name = col_character(),
##   ..   track_album_release_date = col_character(),
##   ..   playlist_name = col_character(),
##   ..   playlist_id = col_character(),
##   ..   playlist_genre = col_character(),
##   ..   playlist_subgenre = col_character(),
##   ..   danceability = col_double(),
##   ..   energy = col_double(),
##   ..   key = col_double(),
##   ..   loudness = col_double(),
##   ..   mode = col_double(),
##   ..   speechiness = col_double(),
##   ..   acousticness = col_double(),
##   ..   instrumentalness = col_double(),
##   ..   liveness = col_double(),
##   ..   valence = col_double(),
##   ..   tempo = col_double(),
##   ..   duration_ms = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(spotify_data) #summary the data

##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

sapply(spotify_data, function(x) sum(is.na(x)))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

The Spotify dataset consists of 32,833 entries (tracks) and 23 variables. These variables encompass a range of information, including track identifiers, names, artists, popularity scores, album details, playlist information, and several musical attributes like danceability, energy, and tempo.

A notable peculiarity is the presence of missing values in some textual fields like track_name and track_artist. Additionally, the track_album_release_date is recorded as a string, which could be more effectively utilized if converted into a date format.

##3.2 Data Importing and Cleaning

#check missing value
sum(is.na(spotify_data))

## [1] 15

# Count missing values in each column
missing_values <- sapply(spotify_data, function(x) sum(is.na(x)))
missing_values

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

#remove missing value
spotify_data <- spotify_data %>%
  mutate(track_name = ifelse(is.na(track_name), "Unknown", track_name),
         track_artist = ifelse(is.na(track_artist), "Unknown", track_artist),
         track_album_name = ifelse(is.na(track_album_name), "Unknown", track_album_name))

#Convert track_album_release_date from a string to a Date object for better handling of date-related operations
spotify_data$track_album_release_date <- as.Date(spotify_data$track_album_release_date)

Logical Process in Data Cleaning Identify Missing Data: The first step is to assess the extent of missing data in the dataset. This helps in understanding the scope of the issue and its potential impact on the analysis.
Detailed Missing Data Analysis: By counting missing values in each column, you can pinpoint exactly where the gaps are. This is important for making informed decisions on how to handle these missing values.
Handling Missing Values: Instead of removing entire rows, which could lead to a significant reduction in data, you opt to fill missing textual information with a placeholder (“Unknown”). This method maintains the dataset’s integrity and size, which is crucial for comprehensive analysis.
Data Type Conversion: Converting the track_album_release_date to a Date type ensures that any date-related analyses (like trend analysis over time) are accurate and feasible.

3.3 Cleaned Dataset Preview

head(spotify_data, 10)

## # A tibble: 10 × 23
##    track_id              track_name track_artist track_popularity track_album_id
##    <chr>                 <chr>      <chr>                   <dbl> <chr>         
##  1 6f807x0ima9a1j3VPbc7… I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
##  2 0r7CVbZTWZgbTCYdfa2P… Memories … Maroon 5                   67 63rPSO264uRjW…
##  3 1z1Hg7Vb0AhHDiEmnDE7… All the T… Zara Larsson               70 1HoSmj2eLcsrR…
##  4 75FpbthrwQmzHlBJLuGd… Call You … The Chainsm…               60 1nqYsOef1yKKu…
##  5 1e8PAfcKUYoKkxPhrHqw… Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
##  6 7fvUMiyapMsRRxr07cU8… Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
##  7 2OAylPUDDfwRGfe0lYql… Never Rea… Katy Perry                 62 7INHYSeusaFly…
##  8 6b1RNvAcJjQH73eZO4BL… Post Malo… Sam Feldt                  69 6703SRPsLkS4b…
##  9 7bF6tCO3gFb8INrEDcjN… Tough Lov… Avicii                     68 7CvAfGvq4RlIw…
## 10 1IXGILkPm0tOCNeq00kC… If I Can'… Shawn Mendes               67 4QxzbfSsVryEQ…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <date>,
## #   playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## #   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## #   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>

3.4 Summary of Key Variables in the Cleaned Dataset

# Summary statistics for key numerical variables
summary_select <- spotify_data %>% select(track_popularity, danceability, energy, tempo, duration_ms)
summary(summary_select)

##  track_popularity  danceability        energy             tempo       
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.000175   Min.   :  0.00  
##  1st Qu.: 24.00   1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 99.96  
##  Median : 45.00   Median :0.6720   Median :0.721000   Median :121.98  
##  Mean   : 42.48   Mean   :0.6548   Mean   :0.698619   Mean   :120.88  
##  3rd Qu.: 62.00   3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.:133.92  
##  Max.   :100.00   Max.   :0.9830   Max.   :1.000000   Max.   :239.44  
##   duration_ms    
##  Min.   :  4000  
##  1st Qu.:187819  
##  Median :216000  
##  Mean   :225800  
##  3rd Qu.:253585  
##  Max.   :517810

# Frequency of categorical variables like genre
table(spotify_data$playlist_genre)

## 
##   edm latin   pop   r&b   rap  rock 
##  6043  5155  5507  5431  5746  4951

My cleaned dataset consists of various variables that capture both the musical attributes of tracks and their popularity metrics on Spotify. The numerical variables like track_popularity, danceability, energy, tempo, and duration_ms have been summarized to give an overview of their distribution. For instance, track_popularity ranges from X to Y, indicating a wide variety of tracks from lesser-known to hits. Similarly, danceability and energy scores vary from A to B and C to D, respectively, reflecting the diverse nature of music in the dataset.”

Additionally, the categorical variable playlist_genre shows a distribution across different music genres. This indicates the dataset’s diversity in terms of music styles, from genre A to genre B, and so on.”

These summaries provide a foundational understanding of the dataset, highlighting the key aspects of the tracks that will be further explored in our analysis. The range and variety in these variables suggest a rich dataset, suitable for in-depth analysis of music trends on Spotify.

4 Exploratory Data Analysis Plan

4.1 Approaches to Analyzing the Data

# Create a new variable for the decade
spotify_data <- spotify_data %>%
  mutate(decade = floor(year(track_album_release_date) / 10) * 10)

# Grouping by genre and decade, then summarizing key metrics
genre_decade_summary <- spotify_data %>%
  group_by(playlist_genre, decade) %>%
  summarise(avg_popularity = mean(track_popularity, na.rm = TRUE),
            avg_danceability = mean(danceability, na.rm = TRUE))

## `summarise()` has grouped output by 'playlist_genre'. You can override using
## the `.groups` argument.

#  Identify the Most Recent Decade
most_recent_decade <- max(spotify_data$decade, na.rm = TRUE)

#  Filter Top 50 Tracks
top_tracks_recent_decade <- spotify_data %>%
  filter(decade == most_recent_decade) %>%
  arrange(desc(track_popularity)) %>%
  head(50)

# Filter for Latin genre in the most recent decade
latin_tracks_recent_decade <- spotify_data %>%
  filter(decade == most_recent_decade, playlist_genre == "latin")


# Filter for Latin tracks
latin_tracks <- spotify_data %>%
  filter(playlist_genre == "latin")

# Filter for Latin genre and select top 5 tracks by popularity
top_5_latin_tracks <- spotify_data %>%
  filter(playlist_genre == "latin") %>%
  arrange(desc(track_popularity)) %>%
  head(5)

# Calculate the average danceability for the top 5 Latin tracks
average_danceability <- mean(spotify_data$danceability, na.rm = TRUE)

4.2 Visualization and Presentation

# Plpt Average Popularity by Genre Over Decades
ggplot(genre_decade_summary, aes(x = decade, y = avg_popularity, group = playlist_genre, color = playlist_genre)) +
  geom_line() +
  labs(title = "Average Popularity by Genre Over Decades", x = "Decade", y = "Average Popularity")

## Warning: Removed 6 rows containing missing values (`geom_line()`).

#Plot genres of Top 50 Tracks in the Most Recent Decade
ggplot(top_tracks_recent_decade, aes(x = reorder(playlist_genre, playlist_genre, function(x)-length(x)))) +
  geom_bar() +
  labs(title = "Genres of Top 50 Tracks in the Most Recent Decade",
       x = "Genre",
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Scatter plot of Track Popularity vs Danceability for Latin tracks
ggplot(latin_tracks, aes(x = danceability, y = track_popularity)) +
  geom_point(alpha = 0.6, color = "darkgreen") +  
  labs(title = "Track Popularity vs Danceability for Latin Tracks",
       x = "Danceability",
       y = "Track Popularity") +
  theme_minimal() +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "red")

# Plot the danceability of the top 5 Latin tracks with the average line
ggplot(top_5_latin_tracks, aes(x = reorder(track_album_name, -danceability), y = danceability)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  coord_flip() +
  labs(title = "Danceability of Top 5 Latin Tracks by Album",
       x = "Album Name",
       y = "Danceability") +
  theme_minimal() +
  geom_hline(yintercept = average_danceability, linetype = "dashed", color = "red", linewidth = 1.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4.3 Skills and Knowledge Development

Data Visualization: While I am able to create basic plots with ggplot2, there are advanced visualization techniques that could better represent the data.
Time Series Analysis: Understanding the changes in music trends over time is crucial. I need to enhance my skills in time series analysis to make more accurate forecasts and understand seasonal patterns within music popularity.

5 Summary

5.1 Problem Statement Summary

The problem statement at the core of this analysis was to discern the factors that contribute to the popularity of tracks on Spotify. The digital transformation of the music industry, propelled by streaming platforms, necessitates a deeper understanding of music consumption patterns. This analysis aimed to uncover the attributes of tracks that resonate with listeners and achieve higher popularity ratings.

5.2 Methodological Approach Summary

To explore this problem, a dataset comprising over 32,000 Spotify tracks was employed, featuring diverse musical attributes. The methodology included:

Data Cleaning and Preparation: Addressing missing values, standardizing data formats, and ensuring overall data quality.
Exploratory Data Analysis: Statistical analysis and visualization techniques were applied to understand the distribution and correlation of musical attributes with track popularity.
Subgroup Analysis: Focused analysis on specific genres, especially Latin, to determine distinctive trends.

5.3 Insights from the Analysis

Genre Trends: Popularity trends over decades revealed shifts in genre dominance, with the rise and fall of genres like rock and EDM.
Genre Dominance: The bar chart of the top 50 tracks in the most recent decade highlighted Latin as the prevailing genre, suggesting a shift in listener preferences.
Danceability vs Popularity: The scatter plot indicated a mild positive correlation between danceability and track popularity for Latin tracks, implying that more danceable tracks tend to be more popular.
Album Attributes: The danceability of top Latin tracks by album demonstrated that certain albums have a consistent theme of high-danceability tracks.

5.4 Consumer Implications

For stakeholders in the music industry—such as artists, producers, and marketers—the analysis suggests that focusing on danceability and aligning with prevailing genre trends could be beneficial strategies. Additionally, understanding decade-specific genre popularity may guide marketing campaigns and production decisions

5.5 Limitations of the Analysis

Dataset Scope: The dataset may not capture all nuances of global music trends, being limited to Spotify’s platform. Causal Inference: The study’s correlational nature does not imply causation. Further research is needed to establish causative factors for track popularity.
Temporal Dynamics: Rapid changes in music trends may not be fully captured in the dataset due to its static nature.
Subjectivity of Popularity: Popularity is a complex metric influenced by many factors beyond the scope of this dataset, including social media influence, marketing efforts, and cultural movements. .

Final Project Evaluation

Eric Jiahong Li

2023-12-08