Introduction

we analyze the Spotify Songs dataset to gain insights into song characteristics and popularity. We aim to understand what makes a song popular and how various attributes like danceability, energy, and more contribute to a song’s success. This analysis will provide valuable insights for music enthusiasts, artists, and the music industry.

Problem Statement

The main problem statement is to identify the key factors that contribute to the popularity of songs on Spotify.

Approach

To address this problem, we will conduct exploratory data analysis (EDA) and visualize the relationships between song attributes and popularity. We will use various plots and statistical analysis to draw meaningful conclusions.

Stakeholders

Stakeholders include music artists, record labels, and music streaming platforms looking to improve song recommendations and understand user preferences.

Exploratory Data Analysis (EDA)

When delving into the Spotify Songs dataset to answer The questions about song popularity and its correlation with various attributes, we have several avenues to explore:

Correlations: We’ll begin by calculating correlations between song attributes, such as danceability, energy, valence, and song popularity. This will help us identify which attributes are strongly associated with a song’s popularity.

Grouping and Aggregation: To gain a deeper understanding, we’ll group songs by various factors, including genre, artist, and release date. This approach will allow us to discern patterns in popularity within these specific groups.

Time Series Analysis: Tracking trends in song popularity over time is crucial. We’ll analyze the data by aggregating it based on release dates, helping us identify whether newer songs tend to be more popular.

To visually represent my findings and make the data more accessible, we’ll utilize various types of plots and tables:

Scatter Plots: We’ll create scatter plots to visualize the relationship between two numeric variables, such as danceability vs. popularity or energy vs. popularity.

Tables: To provide a clear summary of my findings, we’ll generate tables with key statistics, such as means, medians, and standard deviations for various attributes. These tables will be crucial in comparing statistics across different genres or artists

my journey in this analysis may require some learning:

Advanced Statistical Analysis: If we choose to integrate advanced statistical models, we may need to explore techniques like linear regression, multiple regression, or even machine learning models to predict popularity accurately.

Advanced Visualization: For more intricate visualizations, we may need to explore advanced data visualization libraries and techniques that allow us to convey my findings with precision.

my data analysis journey often involves iteration, beginning with straightforward visualizations and progressively incorporating more complex techniques as needed.

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(readr)

Summary Statistics

spotify_data <- read_csv("spotify_songs.csv")

## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

num_variables <- ncol(spotify_data)
missing_data <- sum(is.na(spotify_data))



spotify_cleaned_data <- spotify_data %>%
  drop_na() 

summary_stats <- summary(spotify_data)
summary_stats

##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

Exploratory Data Analysis Continued

Boxplot of Song Popularity by Playlist Genre

ggplot(spotify_cleaned_data, aes(x = reorder(playlist_genre, track_popularity), y = track_popularity)) +
  geom_boxplot(fill = "skyblue", color = "darkblue") +
  labs(title = "Distribution of Song Popularity by Playlist Genre",
       x = "Playlist Genre",
       y = "Song Popularity")

# Time Series Analysis: Popularity Trends Over Time

popularity_by_genre <- spotify_cleaned_data %>%
  group_by(playlist_genre) %>%
  summarize(avg_popularity = mean(track_popularity, na.rm = TRUE))

# Sort the genres by average popularity in descending order
popularity_by_genre <- popularity_by_genre %>%
  arrange(desc(avg_popularity))

Line plot of Popularity Trends Over Time

ggplot(popularity_by_genre, aes(x = reorder(playlist_genre, avg_popularity), y = avg_popularity)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Average Popularity of Songs by Playlist Genre",
       x = "Playlist Genre",
       y = "Average Popularity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(spotify_cleaned_data, aes(x = danceability, y = energy)) +
  geom_point() +
  labs(title = "Danceability vs. Energy",
       x = "Danceability",
       y = "Energy")

# Summary

Summary of Problem Statement

The analysis aimed to identify key factors contributing to the popularity of songs on Spotify.

Summary of Methodology

We employed exploratory data analysis (EDA) to visualize the relationships between song attributes and popularity. Various statistical analyses, including correlations, grouping, and time series analysis, were performed to draw meaningful conclusions.

Summary of Insights

Insights from the analysis include understanding the correlations between song attributes and popularity, identifying patterns within specific groups (genre, artist), and tracking popularity trends over time.

Summary of Implications

Implications of this analysis are relevant for music artists, record labels, and music streaming platforms seeking to improve song recommendations and understand user preferences.

Discussion of Limitations

Limitations of this analysis include potential biases in the dataset, the absence of certain attributes that could influence popularity, and the dynamic nature of music preferences.

Conclusion and Final Remarks

Conclusion

In conclusion, this analysis provided valuable insights into the factors influencing song popularity on Spotify. The correlations, group-level patterns, and trends over time contribute to a comprehensive understanding of the dynamics in the music industry.

Final Remarks

As a data analyst, the journey involves continuous learning, iteration, and adaptation of methods based on the data. The presented analysis serves as a foundation for further exploration, potentially incorporating advanced statistical models and visualization techniques in the future.

References

References:. Spotify Songs dataset (provided in the course). ggplot2, dplyr, tidyr, readr R packages documentation

# Acknowledgements

Special thanks to the course instructor for providing guidance and valuable insights throughout the data analysis journey.

# Session Information

```r
sessionInfo()

## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] readr_2.1.4   tidyr_1.3.0   dplyr_1.1.3   ggplot2_3.4.3
## 
## loaded via a namespace (and not attached):
##  [1] bit_4.0.5         gtable_0.3.4      jsonlite_1.8.7    crayon_1.5.2     
##  [5] compiler_4.3.1    tidyselect_1.2.0  parallel_4.3.1    jquerylib_0.1.4  
##  [9] scales_1.2.1      yaml_2.3.7        fastmap_1.1.1     R6_2.5.1         
## [13] labeling_0.4.3    generics_0.1.3    knitr_1.44        tibble_3.2.1     
## [17] munsell_0.5.0     bslib_0.5.1       pillar_1.9.0      tzdb_0.4.0       
## [21] rlang_1.1.1       utf8_1.2.3        cachem_1.0.8      xfun_0.40        
## [25] sass_0.4.7        bit64_4.0.5       cli_3.6.1         withr_2.5.1      
## [29] magrittr_2.0.3    digest_0.6.33     grid_4.3.1        vroom_1.6.3      
## [33] rstudioapi_0.15.0 hms_1.1.3         lifecycle_1.0.3   vctrs_0.6.3      
## [37] evaluate_0.22     glue_1.6.2        farver_2.1.1      fansi_1.0.4      
## [41] colorspace_2.1-0  rmarkdown_2.25    purrr_1.0.2       tools_4.3.1      
## [45] pkgconfig_2.0.3   htmltools_0.5.6

Final Project Evaluation

2023-12-12