Project1

setwd("/Users/alassanefaye/Library/Mobile Documents/com~apple~CloudDocs/DATA110 ")

Introduction

This analysis explores the relationship between song duration (in seconds), danceability, energy, and popularity using a dataset of 2,000 Spotify songs. My goal is to determine if shorter, more energetic, or danceable songs tend to be more popular.Shorter song lengths have been increasingly popular in the music industry in recent years. According to the Washington Post, regardless of genre, the average song length on the Billboard Hot 100 has dropped from more than four minutes to about three minutes since 1990. The popularity of social media apps like TikTok and streaming services like Spotify is largely to blame for this trend, as shorter songs are more likely to be shared and listened to again, reflecting changing listener preferences for immediacy and shortness.At the same time, dancing trends have grown in popularity, especially on websites like TikTok where choreographed moves to well-known songs frequently go viral. However, based on my own observations, traditional social dancing at parties appears to be less prevalent than it was in the past, maybe as a result of shifting entertainment tastes and social behaviors. The purpose of this study is to determine whether shorter songs are more popular in today’s music industry. I intend to learn more about the elements that influence a song’s success in the present musical context by looking at elements like song length, danceability, energy, and their effect on a track’s popularity.

# Load the dataset
data <- read_csv("spotifysongs.csv")
Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl  (1): explicit

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View dataset structure
glimpse(data)
Rows: 2,000
Columns: 18
$ artist           <chr> "Britney Spears", "blink-182", "Faith Hill", "Bon Jov…
$ song             <chr> "Oops!...I Did It Again", "All The Small Things", "Br…
$ duration_ms      <dbl> 211160, 167066, 250546, 224493, 200560, 253733, 28420…
$ explicit         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE,…
$ year             <dbl> 2000, 1999, 1999, 2000, 2000, 1999, 2000, 2000, 1999,…
$ popularity       <dbl> 77, 79, 66, 78, 65, 69, 86, 68, 75, 77, 1, 56, 55, 62…
$ danceability     <dbl> 0.751, 0.434, 0.529, 0.551, 0.614, 0.706, 0.949, 0.70…
$ energy           <dbl> 0.834, 0.897, 0.496, 0.913, 0.928, 0.888, 0.661, 0.77…
$ key              <dbl> 1, 0, 7, 0, 8, 2, 5, 7, 5, 6, 7, 7, 11, 0, 3, 6, 10, …
$ loudness         <dbl> -5.444, -4.918, -9.007, -4.063, -4.806, -6.959, -4.24…
$ mode             <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,…
$ speechiness      <dbl> 0.0437, 0.0488, 0.0290, 0.0466, 0.0516, 0.0654, 0.057…
$ acousticness     <dbl> 0.30000, 0.01030, 0.17300, 0.02630, 0.04080, 0.11900,…
$ instrumentalness <dbl> 1.77e-05, 0.00e+00, 0.00e+00, 1.35e-05, 1.04e-03, 9.6…
$ liveness         <dbl> 0.3550, 0.6120, 0.2510, 0.3470, 0.0845, 0.0700, 0.045…
$ valence          <dbl> 0.8940, 0.6840, 0.2780, 0.5440, 0.8790, 0.7140, 0.760…
$ tempo            <dbl> 95.053, 148.726, 136.859, 119.992, 172.656, 121.549, …
$ genre            <chr> "pop", "rock, pop", "pop, country", "rock, metal", "p…
# Select relevant variables and convert duration to seconds
data <- data %>% 
  select(duration_ms, popularity, danceability, energy) %>% 
  mutate(duration_sec = duration_ms / 1000)

# Check for missing values
sum(is.na(data))
[1] 0
# Run a multiple linear regression model
model <- lm(popularity ~ duration_sec + danceability + energy, data = data)

# Display model summary
summary(model)

Call:
lm(formula = popularity ~ duration_sec + danceability + energy, 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-62.436  -3.741   5.630  13.489  28.922 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  54.87264    4.64330  11.818   <2e-16 ***
duration_sec  0.02710    0.01225   2.212   0.0271 *  
danceability -0.24757    3.42311  -0.072   0.9424    
energy       -1.43528    3.15092  -0.456   0.6488    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.32 on 1996 degrees of freedom
Multiple R-squared:  0.002666,  Adjusted R-squared:  0.001167 
F-statistic: 1.779 on 3 and 1996 DF,  p-value: 0.1492

Visualization

ggplot(data, aes(x = duration_sec, y = popularity, color = energy)) +   
  geom_point(aes(size = danceability), alpha = 0.7) +    
  geom_smooth(method = "lm", color = "white") +     
  labs(     
    title = "Relationship Between Song Features and Popularity",     
    subtitle = "Analyzing the Impact of Duration, Danceability, and Energy on Spotify popularity",     
    x = "Song Duration (seconds)",     
    y = "Popularity (0-100 scale)",     
    color = "Energy Level",     
    size = "Danceability",     
    caption = "Data Source: Spotify"   
  ) +   
  scale_color_gradient(low = "yellow", high = "hotpink") + 
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Conclusion

The multiple linear regression analysis shows that song popularity is weakly correlated with duration, danceability, and energy. The correlation remains low, suggesting that other factors may play a more significant role in determining a song’s popularity.To make sure reliability, I cleaned the dataset before starting the analysis. For a simpler interpretation, I changed the song’s duration from milliseconds to seconds and looked for any missing data, but I couldn’t discover any. Along with confirming that danceability and energy were already uniformed on a 0-to-1 scale and didn’t need any additional transformation, I also made sure there were no duplicate rows.Using energy (color) and danceability (size), the visual investigates the connection between song length and popularity. Despite industry trends, the regression line indicates a relatively modest association, indicating that shorter songs are not always more popular. To my surprise, considering the prevalence of social media dance trends, energy levels are uniformly distributed and danceability does not exhibit a strong correlation with popularity.I believe a few limitations include the failure to investigate multivariate interactions, such as the joint influence of energy and danceability on popularity. It would also be informative to examine genre differences or whether shorter songs have become more popular recently. Even though song lengths have shrunk, this analysis does not provide compelling evidence that shorter songs are more popular. A more thorough understanding of current music trends may be possible with a closer examination of genre and time trends. Overall, I believe this test was interesting and informative.

```