The data set I have chosen to work with is the Spotify songs csv. The source for this data set is Spotify. This data set contains many different types of categorical and quantitative variables such as artist names, the duration of the song in milliseconds, and genre. For this project, I plan to explore the influence song characteristics have on the popularity of a song. I will use the following variables to test their influence: danceability, explicit (use of profanity), energy, tempo, duration in milliseconds, and loudness.
This chunk Loads all of the necessary packages that will be used in this project.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(dplyr)
Here, I set my working directory to my class folder. Then I load the data set.
setwd("C:/Users/MyPC/Downloads/Data 110") # setting the working directoryspotify <-read_csv("spotifysongs.csv") # Loading the data set
Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl (1): explicit
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Using the Epidemiologist R Handbook website, I check for any NA values in the data set.
I found the specific code I needed at 20.3 Useful functions.
sum(is.na(spotify)) # Checking for any NA values
[1] 0
The function returned the value 0. Hence, there are no NA values in the data set and no need for cleaning.
This chunk performs the first linear regression analysis
main_model <-lm(popularity ~ danceability + explicit + energy + tempo + duration_ms + loudness, data = spotify) #Performing a linear regression analysis and setting the results to the variable main_modelsummary(main_model) # Summarizing the results, this displays them.
Call:
lm(formula = popularity ~ danceability + explicit + energy +
tempo + duration_ms + loudness, data = spotify)
Residuals:
Min 1Q Median 3Q Max
-64.879 -3.738 5.475 13.222 28.509
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.293e+01 6.374e+00 9.873 <2e-16 ***
danceability -1.746e+00 3.585e+00 -0.487 0.6263
explicitTRUE 1.965e+00 1.125e+00 1.746 0.0809 .
energy -7.504e+00 4.193e+00 -1.790 0.0736 .
tempo 1.233e-02 1.818e-02 0.678 0.4977
duration_ms 2.549e-05 1.236e-05 2.062 0.0393 *
loudness 7.879e-01 3.251e-01 2.424 0.0154 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.29 on 1993 degrees of freedom
Multiple R-squared: 0.007489, Adjusted R-squared: 0.004501
F-statistic: 2.506 on 6 and 1993 DF, p-value: 0.02028
With an adjusted R square of 0.4501%, the danceability variable has the highest p value, so we will remove that and perform the test again.
model2 <-lm(popularity ~ explicit + energy + tempo + duration_ms + loudness, data = spotify) # Dance ability had the highest p value, I removed it and did another analysissummary(model2) # Displaying the results
Call:
lm(formula = popularity ~ explicit + energy + tempo + duration_ms +
loudness, data = spotify)
Residuals:
Min 1Q Median 3Q Max
-65.130 -3.772 5.487 13.238 28.644
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.137e+01 5.501e+00 11.156 <2e-16 ***
explicitTRUE 1.825e+00 1.088e+00 1.678 0.0935 .
energy -7.390e+00 4.186e+00 -1.766 0.0776 .
tempo 1.389e-02 1.789e-02 0.777 0.4374
duration_ms 2.611e-05 1.229e-05 2.124 0.0338 *
loudness 7.826e-01 3.248e-01 2.409 0.0161 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.28 on 1994 degrees of freedom
Multiple R-squared: 0.007371, Adjusted R-squared: 0.004882
F-statistic: 2.961 on 5 and 1994 DF, p-value: 0.01142
The adjusted r squared has increased from 0.4501% to 0.4882%. The tempo variable now has the highest p value, so we will remove that and try again.
model3 <-lm(popularity ~ explicit + energy + duration_ms + loudness, data = spotify) # Tempo had the next highest p value, I removed it and did another analysissummary(model3) # Displaying the results
Call:
lm(formula = popularity ~ explicit + energy + duration_ms + loudness,
data = spotify)
Residuals:
Min 1Q Median 3Q Max
-64.599 -3.672 5.505 13.236 28.656
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.271e+01 5.219e+00 12.016 <2e-16 ***
explicitTRUE 1.861e+00 1.087e+00 1.712 0.0870 .
energy -6.943e+00 4.145e+00 -1.675 0.0941 .
duration_ms 2.590e-05 1.229e-05 2.107 0.0352 *
loudness 7.756e-01 3.247e-01 2.389 0.0170 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.28 on 1995 degrees of freedom
Multiple R-squared: 0.007071, Adjusted R-squared: 0.00508
F-statistic: 3.552 on 4 and 1995 DF, p-value: 0.006791
The adjusted R value increased again from 0.4882% to 0.508%. The next highest p value is energy, so we remove that and run the test again.
model4 <-lm(popularity ~ explicit + duration_ms + loudness, data = spotify) # Energy has the next highest p value, removed it and did another analysissummary(model4) # Display the results
Call:
lm(formula = popularity ~ explicit + duration_ms + loudness,
data = spotify)
Residuals:
Min 1Q Median 3Q Max
-63.543 -3.728 5.480 13.267 29.409
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.562e+01 3.048e+00 18.246 <2e-16 ***
explicitTRUE 2.105e+00 1.077e+00 1.954 0.0508 .
duration_ms 2.630e-05 1.229e-05 2.140 0.0325 *
loudness 4.243e-01 2.479e-01 1.712 0.0871 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.29 on 1996 degrees of freedom
Multiple R-squared: 0.005675, Adjusted R-squared: 0.00418
F-statistic: 3.797 on 3 and 1996 DF, p-value: 0.009923
The adjusted R squared has decreased from 0.508% to 0.418%. That means the energy variable influences the popularity of a song. Through this, we know the variables explicit, energy, duration_ms, and loudness influence popularity.
Now I will make a graph demonstrating the popularity of songs made by Post Malone, Linkin Park, Red Hot Chili Peppers, and Arctic Monkeys based on loudness and whether it contains explicits or not.
four_artists <- spotify |># Creating a new variable using the data setfilter(artist %in%c("Post Malone", "Linkin Park", "Red Hot Chili Peppers", "Arctic Monkeys")) # Filtering all the data to only these four artistsgraph <-ggplot(four_artists, aes(x = loudness, y = popularity, color = artist)) +# Create a new graph variable. Use the filtered artist data. Set the x, y, and fill aestheticsgeom_point(aes(shape = explicit), size =5, alpha =0.7) +# Make a scatter plot. Define the shape of the points, their size and their transparencyscale_color_manual(values =c("cyan", "purple", "gold", "red")) +# Changing the colours of the points on the scatter plottheme_minimal() +# Setting the background to white instead of greylabs(x ="Loudness in Decibels", y ="Popularity", title ="Popularity of Different Artists Based on Explicits and Loudness in Decibels", shape ="Use of Explicits", color ="Artists", caption ="Spotify") # Label everythinggraph # Calling the function
Now I will write the second brief essay down below
While I did filter the data to only display four artists when making the graph, those are the only alterations I made to the data set. At the beginning of the document, I checked whether or not there were any NA values to be found by finding the sum of is.na. The result was that there were 0 NA values. Hence, I did not have anything to clean up. In regard to the filtration of data, there were simply too many artists in the data set, and so it was my best interest to narrow down the number of artists. My reason for picking these artists was that of the many options, I found Post Malone to be the only one I really listened to. The rest were simply me liking their names and picking them.
The visualization represents the relationship between the popularity of an artist’s song, the loudness of the song, and whether they used explicit language or material in their song. I believe I chose too many popular artists because of the four, including all their songs, only 1 fall below 60 in popularity. Regarding trends in the graph, it appears that Post Malone has the most consistent in popularity while having the most inconsistencies in the loudness of their music. It is also a little surprising that they have the most popular and least popular song out of all four artists. Linkin Park and the Red Hot Chili Peppers tend to remain consistent in how loud their music is and how popular they are. Overall, it would appear that songs are less popular when they have a loudness decibel or around -5 to -3. I thought the use of explicit language/material in a song would have a greater impact on its popularity, but I do not see any trends between the two.
One thing I wanted to include in the graph was a bit of interactivity. Specifically, I wanted to make it so when you hovered over one of the points, you could see which song it was. However, I could not figure out how to get the code to work. I also wish I could include many more artists, but it becomes a bit of an issue as everything gets too clustered and difficult to interpret. One other thing I wished to at least mention is that the adjusted R squared value was extremely low throughout all linear regression models. This tells us that while the factors used in the model had some significance to the popularity of a song, it was outside factors not investigated which play a bigger role in popularity.
This is the link to the source I used. https://www.epirhandbook.com/en/new_pages/missing_data.html