project 2

Author

O Kandji

The topic of this data analysis is Spotify song characteristics, focusing on various musical features and their impact on song popularity. The primary variables include song title, artist, duration, explicit content status, release year, popularity, and musical features such as danceability, energy, and tempo. This data, sourced from Spotify, provides a comprehensive view of the musical landscape over the years and helps uncover trends in popular music.

Streaming platforms have revolutionized the music industry, offering vast libraries of songs and valuable data. Analyzing this data reveals listener preferences and the characteristics that make songs popular. This topic is significant as it aligns with my interest in understanding the dynamics of the music industry and the factors that drive the success of songs

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(ggrepel)

source: https://topesdegama.com/app/uploads-topesdegama.com/2018/08/Spotify-2.jpg

source: https://topesdegama.com/app/uploads-topesdegama.com/2018/08/Spotify-2.jpg
setwd("/Users/hunchoamaru/Desktop/data 110")
spotify_data <- read_csv('spotifysongs.csv')
Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl  (1): explicit

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(spotify_data)
# A tibble: 6 × 18
  artist   song  duration_ms explicit  year popularity danceability energy   key
  <chr>    <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
1 Britney… Oops…      211160 FALSE     2000         77        0.751  0.834     1
2 blink-1… All …      167066 FALSE     1999         79        0.434  0.897     0
3 Faith H… Brea…      250546 FALSE     1999         66        0.529  0.496     7
4 Bon Jovi It's…      224493 FALSE     2000         78        0.551  0.913     0
5 *NSYNC   Bye …      200560 FALSE     2000         65        0.614  0.928     8
6 Sisqo    Thon…      253733 TRUE      1999         69        0.706  0.888     2
# ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>

mutate columns to appropriate data types

spotify_data <- spotify_data %>%
  mutate(across(where(is.character), as.factor))

summarize

summary(spotify_data)
            artist          song       duration_ms      explicit      
 Rihanna       :  25   Sorry  :   5   Min.   :113000   Mode :logical  
 Drake         :  23   Breathe:   3   1st Qu.:203580   FALSE:1449     
 Eminem        :  21   Closer :   3   Median :223280   TRUE :551      
 Calvin Harris :  20   Don't  :   3   Mean   :228748                  
 Britney Spears:  19   Faded  :   3   3rd Qu.:248133                  
 David Guetta  :  18   Higher :   3   Max.   :484146                  
 (Other)       :1874   (Other):1980                                   
      year        popularity     danceability        energy      
 Min.   :1998   Min.   : 0.00   Min.   :0.1290   Min.   :0.0549  
 1st Qu.:2004   1st Qu.:56.00   1st Qu.:0.5810   1st Qu.:0.6220  
 Median :2010   Median :65.50   Median :0.6760   Median :0.7360  
 Mean   :2009   Mean   :59.87   Mean   :0.6674   Mean   :0.7204  
 3rd Qu.:2015   3rd Qu.:73.00   3rd Qu.:0.7640   3rd Qu.:0.8390  
 Max.   :2020   Max.   :89.00   Max.   :0.9750   Max.   :0.9990  
                                                                 
      key            loudness            mode         speechiness     
 Min.   : 0.000   Min.   :-20.514   Min.   :0.0000   Min.   :0.02320  
 1st Qu.: 2.000   1st Qu.: -6.490   1st Qu.:0.0000   1st Qu.:0.03960  
 Median : 6.000   Median : -5.285   Median :1.0000   Median :0.05985  
 Mean   : 5.378   Mean   : -5.512   Mean   :0.5535   Mean   :0.10357  
 3rd Qu.: 8.000   3rd Qu.: -4.168   3rd Qu.:1.0000   3rd Qu.:0.12900  
 Max.   :11.000   Max.   : -0.276   Max.   :1.0000   Max.   :0.57600  
                                                                      
  acousticness       instrumentalness       liveness         valence      
 Min.   :0.0000192   Min.   :0.0000000   Min.   :0.0215   Min.   :0.0381  
 1st Qu.:0.0140000   1st Qu.:0.0000000   1st Qu.:0.0881   1st Qu.:0.3867  
 Median :0.0557000   Median :0.0000000   Median :0.1240   Median :0.5575  
 Mean   :0.1289549   Mean   :0.0152260   Mean   :0.1812   Mean   :0.5517  
 3rd Qu.:0.1762500   3rd Qu.:0.0000683   3rd Qu.:0.2410   3rd Qu.:0.7300  
 Max.   :0.9760000   Max.   :0.9850000   Max.   :0.8530   Max.   :0.9730  
                                                                          
     tempo                          genre    
 Min.   : 60.02   pop                  :428  
 1st Qu.: 98.99   hip hop, pop         :277  
 Median :120.02   hip hop, pop, R&B    :244  
 Mean   :120.12   pop, Dance/Electronic:221  
 3rd Qu.:134.27   pop, R&B             :178  
 Max.   :210.85   hip hop              :124  
                  (Other)              :528  

Filter data for songs released in the year 2000

Summarize data to get average popularity by Genre

average_popularity_by_genre <- spotify_data %>%
  group_by(genre) %>%
  summarize(avg_popularity = mean(popularity, na.rm = TRUE))

Line plot of avg songs p.o.y

ggplot(average_popularity_by_genre, aes(x = genre, y = avg_popularity, group = 1)) +
  geom_line(color = "red") + 
  theme_minimal() + # Apply a minimal theme
  labs(title = "Average Song Popularity by Genre", 
       x = "Genre", # Label the x-axis
       y = "Average Popularity") + # Label the y-axis
  theme(axis.text.x = element_text(angle = 90, hjust = 1), 
        plot.margin = margin(1, 1, 1.5, 1.5, "cm"))

highchart() %>%
  hc_add_series(average_popularity_by_genre, "line", hcaes(x = genre, y = avg_popularity), name = "Average Popularity") %>%
  hc_title(text = "Average Song Popularity by Genre") %>%
  hc_xAxis(title = list(text = "Genre")) %>%
  hc_yAxis(title = list(text = "Average Popularity"))

Linear regression model

model <- lm(popularity ~ danceability + energy, data = spotify_data)
summary(model)

Call:
lm(formula = popularity ~ danceability + energy, data = spotify_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-61.228  -3.476   5.639  13.345  29.261 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   61.8494     3.4106  18.135   <2e-16 ***
danceability  -0.7687     3.4183  -0.225    0.822    
energy        -2.0320     3.1424  -0.647    0.518    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.34 on 1997 degrees of freedom
Multiple R-squared:  0.0002219, Adjusted R-squared:  -0.0007794 
F-statistic: 0.2216 on 2 and 1997 DF,  p-value: 0.8012
par(mfrow = c(1,1 ))
plot(model)

boxplot of popularity by genre

ggplot(spotify_data, aes(x = genre, y = popularity, fill = genre)) +
  geom_boxplot() +
  ggtitle("Popularity by Genre") +
  xlab("Genre") +
  ylab("Popularity") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10), 
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),  
    axis.title.x = element_text(size = 12, face = "bold"),  
    axis.title.y = element_text(size = 12, face = "bold"),  
    legend.position = "none", 
    plot.margin = margin(10, 10, 10, 10)  
  )