Spotify Songs Project 1

Author

Oliver Kronen

Introduction

The data set I have chosen to work with is the Spotify songs csv. The source for this data set is Spotify. This data set contains many different types of categorical and quantitative variables such as artist names, the duration of the song in milliseconds, and genre. For this project, I plan to explore the influence song characteristics have on the popularity of a song. I will use the following variables to test their influence: danceability, explicit (use of profanity), energy, tempo, duration in milliseconds, and loudness.

This chunk Loads all of the necessary packages that will be used in this project.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)

Here, I set my working directory to my class folder. Then I load the data set.

setwd("C:/Users/MyPC/Downloads/Data 110") # setting the working directory
spotify <- read_csv("spotifysongs.csv") # Loading the data set
Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl  (1): explicit

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Using the Epidemiologist R Handbook website, I check for any NA values in the data set.

I found the specific code I needed at 20.3 Useful functions.

sum(is.na(spotify)) # Checking for any NA values
[1] 0

The function returned the value 0. Hence, there are no NA values in the data set and no need for cleaning.

This chunk performs the first linear regression analysis

main_model <- lm(popularity ~ danceability + explicit + energy + tempo + duration_ms + loudness, data = spotify) #Performing a linear regression analysis and setting the results to the variable main_model
summary(main_model) # Summarizing the results, this displays them.

Call:
lm(formula = popularity ~ danceability + explicit + energy + 
    tempo + duration_ms + loudness, data = spotify)

Residuals:
    Min      1Q  Median      3Q     Max 
-64.879  -3.738   5.475  13.222  28.509 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.293e+01  6.374e+00   9.873   <2e-16 ***
danceability -1.746e+00  3.585e+00  -0.487   0.6263    
explicitTRUE  1.965e+00  1.125e+00   1.746   0.0809 .  
energy       -7.504e+00  4.193e+00  -1.790   0.0736 .  
tempo         1.233e-02  1.818e-02   0.678   0.4977    
duration_ms   2.549e-05  1.236e-05   2.062   0.0393 *  
loudness      7.879e-01  3.251e-01   2.424   0.0154 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.29 on 1993 degrees of freedom
Multiple R-squared:  0.007489,  Adjusted R-squared:  0.004501 
F-statistic: 2.506 on 6 and 1993 DF,  p-value: 0.02028

With an adjusted R square of 0.4501%, the danceability variable has the highest p value, so we will remove that and perform the test again.

model2 <- lm(popularity ~ explicit + energy + tempo + duration_ms + loudness, data = spotify) # Dance ability had the highest p value, I removed it and did another analysis
summary(model2) # Displaying the results

Call:
lm(formula = popularity ~ explicit + energy + tempo + duration_ms + 
    loudness, data = spotify)

Residuals:
    Min      1Q  Median      3Q     Max 
-65.130  -3.772   5.487  13.238  28.644 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.137e+01  5.501e+00  11.156   <2e-16 ***
explicitTRUE  1.825e+00  1.088e+00   1.678   0.0935 .  
energy       -7.390e+00  4.186e+00  -1.766   0.0776 .  
tempo         1.389e-02  1.789e-02   0.777   0.4374    
duration_ms   2.611e-05  1.229e-05   2.124   0.0338 *  
loudness      7.826e-01  3.248e-01   2.409   0.0161 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.28 on 1994 degrees of freedom
Multiple R-squared:  0.007371,  Adjusted R-squared:  0.004882 
F-statistic: 2.961 on 5 and 1994 DF,  p-value: 0.01142

The adjusted r squared has increased from 0.4501% to 0.4882%. The tempo variable now has the highest p value, so we will remove that and try again.

model3 <- lm(popularity ~ explicit + energy + duration_ms + loudness, data = spotify) # Tempo had the next highest p value, I removed it and did another analysis
summary(model3) # Displaying the results

Call:
lm(formula = popularity ~ explicit + energy + duration_ms + loudness, 
    data = spotify)

Residuals:
    Min      1Q  Median      3Q     Max 
-64.599  -3.672   5.505  13.236  28.656 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.271e+01  5.219e+00  12.016   <2e-16 ***
explicitTRUE  1.861e+00  1.087e+00   1.712   0.0870 .  
energy       -6.943e+00  4.145e+00  -1.675   0.0941 .  
duration_ms   2.590e-05  1.229e-05   2.107   0.0352 *  
loudness      7.756e-01  3.247e-01   2.389   0.0170 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.28 on 1995 degrees of freedom
Multiple R-squared:  0.007071,  Adjusted R-squared:  0.00508 
F-statistic: 3.552 on 4 and 1995 DF,  p-value: 0.006791

The adjusted R value increased again from 0.4882% to 0.508%. The next highest p value is energy, so we remove that and run the test again.

model4 <- lm(popularity ~ explicit + duration_ms + loudness, data = spotify) # Energy has the next highest p value, removed it and did another analysis
summary(model4) # Display the results

Call:
lm(formula = popularity ~ explicit + duration_ms + loudness, 
    data = spotify)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.543  -3.728   5.480  13.267  29.409 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.562e+01  3.048e+00  18.246   <2e-16 ***
explicitTRUE 2.105e+00  1.077e+00   1.954   0.0508 .  
duration_ms  2.630e-05  1.229e-05   2.140   0.0325 *  
loudness     4.243e-01  2.479e-01   1.712   0.0871 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.29 on 1996 degrees of freedom
Multiple R-squared:  0.005675,  Adjusted R-squared:  0.00418 
F-statistic: 3.797 on 3 and 1996 DF,  p-value: 0.009923

The adjusted R squared has decreased from 0.508% to 0.418%. That means the energy variable influences the popularity of a song. Through this, we know the variables explicit, energy, duration_ms, and loudness influence popularity.

Now I will make a graph demonstrating the popularity of songs made by Post Malone, Linkin Park, Red Hot Chili Peppers, and Arctic Monkeys based on loudness and whether it contains explicits or not.

four_artists <- spotify |> # Creating a new variable using the data set
  filter(artist %in% c("Post Malone", "Linkin Park", "Red Hot Chili Peppers", "Arctic Monkeys")) # Filtering all the data to only these four artists

graph <- ggplot(four_artists, aes(x = loudness, y = popularity, color = artist)) + # Create a new graph variable. Use the filtered artist data. Set the x, y, and fill aesthetics
  geom_point(aes(shape = explicit), size = 5, alpha = 0.7) + # Make a scatter plot. Define the shape of the points, their size and their transparency
  scale_color_manual(values = c("cyan", "purple", "gold", "red")) + # Changing the colours of the points on the scatter plot
  theme_minimal() + # Setting the background to white instead of grey
  labs(x = "Loudness in Decibels", y = "Popularity", title = "Popularity of Different Artists Based on Explicits and Loudness in Decibels", shape = "Use of Explicits", color = "Artists", caption = "Spotify") # Label everything
graph # Calling the function

Now I will write the second brief essay down below

While I did filter the data to only display four artists when making the graph, those are the only alterations I made to the data set. At the beginning of the document, I checked whether or not there were any NA values to be found by finding the sum of is.na. The result was that there were 0 NA values. Hence, I did not have anything to clean up. In regard to the filtration of data, there were simply too many artists in the data set, and so it was my best interest to narrow down the number of artists. My reason for picking these artists was that of the many options, I found Post Malone to be the only one I really listened to. The rest were simply me liking their names and picking them. 

The visualization represents the relationship between the popularity of an artist’s song, the loudness of the song, and whether they used explicit language or material in their song. I believe I chose too many popular artists because of the four, including all their songs, only 1 fall below 60 in popularity. Regarding trends in the graph, it appears that Post Malone has the most consistent in popularity while having the most inconsistencies in the loudness of their music. It is also a little surprising that they have the most popular and least popular song out of all four artists. Linkin Park and the Red Hot Chili Peppers tend to remain consistent in how loud their music is and how popular they are. Overall, it would appear that songs are less popular when they have a loudness decibel or around -5 to -3. I thought the use of explicit language/material in a song would have a greater impact on its popularity, but I do not see any trends between the two. 

One thing I wanted to include in the graph was a bit of interactivity. Specifically, I wanted to make it so when you hovered over one of the points, you could see which song it was. However, I could not figure out how to get the code to work. I also wish I could include many more artists, but it becomes a bit of an issue as everything gets too clustered and difficult to interpret. One other thing I wished to at least mention is that the adjusted R squared value was extremely low throughout all linear regression models. This tells us that while the factors used in the model had some significance to the popularity of a song, it was outside factors not investigated which play a bigger role in popularity.