we analyze the Spotify Songs dataset to gain insights into song characteristics and popularity. We aim to understand what makes a song popular and how various attributes like danceability, energy, and more contribute to a song’s success. This analysis will provide valuable insights for music enthusiasts, artists, and the music industry.
The main problem statement is to identify the key factors that contribute to the popularity of songs on Spotify.
To address this problem, we will conduct exploratory data analysis (EDA) and visualize the relationships between song attributes and popularity. We will use various plots and statistical analysis to draw meaningful conclusions.
Stakeholders include music artists, record labels, and music streaming platforms looking to improve song recommendations and understand user preferences.
When delving into the Spotify Songs dataset to answer The questions about song popularity and its correlation with various attributes, we have several avenues to explore:
Correlations: We’ll begin by calculating correlations between song attributes, such as danceability, energy, valence, and song popularity. This will help us identify which attributes are strongly associated with a song’s popularity.
Grouping and Aggregation: To gain a deeper understanding, we’ll group songs by various factors, including genre, artist, and release date. This approach will allow us to discern patterns in popularity within these specific groups.
Time Series Analysis: Tracking trends in song popularity over time is crucial. We’ll analyze the data by aggregating it based on release dates, helping us identify whether newer songs tend to be more popular.
To visually represent my findings and make the data more accessible, we’ll utilize various types of plots and tables:
Scatter Plots: We’ll create scatter plots to visualize the relationship between two numeric variables, such as danceability vs. popularity or energy vs. popularity.
Tables: To provide a clear summary of my findings, we’ll generate tables with key statistics, such as means, medians, and standard deviations for various attributes. These tables will be crucial in comparing statistics across different genres or artists
my journey in this analysis may require some learning:
Advanced Statistical Analysis: If we choose to integrate advanced statistical models, we may need to explore techniques like linear regression, multiple regression, or even machine learning models to predict popularity accurately.
Advanced Visualization: For more intricate visualizations, we may need to explore advanced data visualization libraries and techniques that allow us to convey my findings with precision.
my data analysis journey often involves iteration, beginning with straightforward visualizations and progressively incorporating more complex techniques as needed.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(readr)
spotify_data <- read_csv("spotify_songs.csv")
## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
num_variables <- ncol(spotify_data)
missing_data <- sum(is.na(spotify_data))
spotify_cleaned_data <- spotify_data %>%
drop_na()
summary_stats <- summary(spotify_data)
summary_stats
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
ggplot(spotify_cleaned_data, aes(x = reorder(playlist_genre, track_popularity), y = track_popularity)) +
geom_boxplot(fill = "skyblue", color = "darkblue") +
labs(title = "Distribution of Song Popularity by Playlist Genre",
x = "Playlist Genre",
y = "Song Popularity")
# Time Series Analysis: Popularity Trends Over Time
popularity_by_genre <- spotify_cleaned_data %>%
group_by(playlist_genre) %>%
summarize(avg_popularity = mean(track_popularity, na.rm = TRUE))
# Sort the genres by average popularity in descending order
popularity_by_genre <- popularity_by_genre %>%
arrange(desc(avg_popularity))
ggplot(popularity_by_genre, aes(x = reorder(playlist_genre, avg_popularity), y = avg_popularity)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Average Popularity of Songs by Playlist Genre",
x = "Playlist Genre",
y = "Average Popularity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(spotify_cleaned_data, aes(x = danceability, y = energy)) +
geom_point() +
labs(title = "Danceability vs. Energy",
x = "Danceability",
y = "Energy")
# Summary
The analysis aimed to identify key factors contributing to the popularity of songs on Spotify.
We employed exploratory data analysis (EDA) to visualize the relationships between song attributes and popularity. Various statistical analyses, including correlations, grouping, and time series analysis, were performed to draw meaningful conclusions.
Insights from the analysis include understanding the correlations between song attributes and popularity, identifying patterns within specific groups (genre, artist), and tracking popularity trends over time.
Implications of this analysis are relevant for music artists, record labels, and music streaming platforms seeking to improve song recommendations and understand user preferences.
Limitations of this analysis include potential biases in the dataset, the absence of certain attributes that could influence popularity, and the dynamic nature of music preferences.
In conclusion, this analysis provided valuable insights into the factors influencing song popularity on Spotify. The correlations, group-level patterns, and trends over time contribute to a comprehensive understanding of the dynamics in the music industry.
As a data analyst, the journey involves continuous learning, iteration, and adaptation of methods based on the data. The presented analysis serves as a foundation for further exploration, potentially incorporating advanced statistical models and visualization techniques in the future.
References:. Spotify Songs dataset (provided in the course). ggplot2, dplyr, tidyr, readr R packages documentation
# Acknowledgements
Special thanks to the course instructor for providing guidance and valuable insights throughout the data analysis journey.
# Session Information
```r
sessionInfo()
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] readr_2.1.4 tidyr_1.3.0 dplyr_1.1.3 ggplot2_3.4.3
##
## loaded via a namespace (and not attached):
## [1] bit_4.0.5 gtable_0.3.4 jsonlite_1.8.7 crayon_1.5.2
## [5] compiler_4.3.1 tidyselect_1.2.0 parallel_4.3.1 jquerylib_0.1.4
## [9] scales_1.2.1 yaml_2.3.7 fastmap_1.1.1 R6_2.5.1
## [13] labeling_0.4.3 generics_0.1.3 knitr_1.44 tibble_3.2.1
## [17] munsell_0.5.0 bslib_0.5.1 pillar_1.9.0 tzdb_0.4.0
## [21] rlang_1.1.1 utf8_1.2.3 cachem_1.0.8 xfun_0.40
## [25] sass_0.4.7 bit64_4.0.5 cli_3.6.1 withr_2.5.1
## [29] magrittr_2.0.3 digest_0.6.33 grid_4.3.1 vroom_1.6.3
## [33] rstudioapi_0.15.0 hms_1.1.3 lifecycle_1.0.3 vctrs_0.6.3
## [37] evaluate_0.22 glue_1.6.2 farver_2.1.1 fansi_1.0.4
## [41] colorspace_2.1-0 rmarkdown_2.25 purrr_1.0.2 tools_4.3.1
## [45] pkgconfig_2.0.3 htmltools_0.5.6