Project 2 Emilio S

Author

Emilio Sanchez San Martin

knitr::include_graphics("https://download.logo.wine/logo/Spotify/Spotify-Logo.wine.png")

knitr::opts_chunk$set(echo = TRUE)

Introduction

Spotify is one of the world’s most famous music streaming service that has been known since 206. It has a vast library of songs and artists, and it is used by used by millions of people around the world. In this project, we will investigate the Spotify’s data set to answer some questions about specific songs, artists, and the platform itself. We will look at the most popular songs, artists, and genres, and we will also explore the relationship between the popularity of a song and its characteristics. This data set has been webscrapped, and it contains information from songs in 1998 to 2020 songs.

Upload all the necessary libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)
library(readr)
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
library(ggfortify)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Load the data

setwd("~/Downloads/Data 101 and Data 110 class/Data 110")
spotifysongs <- read_csv("Data Sets/spotifysongs.csv")
Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl  (1): explicit

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Exploration

head(spotifysongs)
# A tibble: 6 × 18
  artist   song  duration_ms explicit  year popularity danceability energy   key
  <chr>    <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
1 Britney… Oops…      211160 FALSE     2000         77        0.751  0.834     1
2 blink-1… All …      167066 FALSE     1999         79        0.434  0.897     0
3 Faith H… Brea…      250546 FALSE     1999         66        0.529  0.496     7
4 Bon Jovi It's…      224493 FALSE     2000         78        0.551  0.913     0
5 *NSYNC   Bye …      200560 FALSE     2000         65        0.614  0.928     8
6 Sisqo    Thon…      253733 TRUE      1999         69        0.706  0.888     2
# ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>

Here is the following variables:

str(spotifysongs)
spc_tbl_ [2,000 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ artist          : chr [1:2000] "Britney Spears" "blink-182" "Faith Hill" "Bon Jovi" ...
 $ song            : chr [1:2000] "Oops!...I Did It Again" "All The Small Things" "Breathe" "It's My Life" ...
 $ duration_ms     : num [1:2000] 211160 167066 250546 224493 200560 ...
 $ explicit        : logi [1:2000] FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ year            : num [1:2000] 2000 1999 1999 2000 2000 ...
 $ popularity      : num [1:2000] 77 79 66 78 65 69 86 68 75 77 ...
 $ danceability    : num [1:2000] 0.751 0.434 0.529 0.551 0.614 0.706 0.949 0.708 0.713 0.72 ...
 $ energy          : num [1:2000] 0.834 0.897 0.496 0.913 0.928 0.888 0.661 0.772 0.678 0.808 ...
 $ key             : num [1:2000] 1 0 7 0 8 2 5 7 5 6 ...
 $ loudness        : num [1:2000] -5.44 -4.92 -9.01 -4.06 -4.81 ...
 $ mode            : num [1:2000] 0 1 1 0 0 1 0 1 0 1 ...
 $ speechiness     : num [1:2000] 0.0437 0.0488 0.029 0.0466 0.0516 0.0654 0.0572 0.0322 0.102 0.0379 ...
 $ acousticness    : num [1:2000] 0.3 0.0103 0.173 0.0263 0.0408 0.119 0.0302 0.0267 0.273 0.00793 ...
 $ instrumentalness: num [1:2000] 1.77e-05 0.00 0.00 1.35e-05 1.04e-03 9.64e-05 0.00 0.00 0.00 2.93e-02 ...
 $ liveness        : num [1:2000] 0.355 0.612 0.251 0.347 0.0845 0.07 0.0454 0.467 0.149 0.0634 ...
 $ valence         : num [1:2000] 0.894 0.684 0.278 0.544 0.879 0.714 0.76 0.861 0.734 0.869 ...
 $ tempo           : num [1:2000] 95.1 148.7 136.9 120 172.7 ...
 $ genre           : chr [1:2000] "pop" "rock, pop" "pop, country" "rock, metal" ...
 - attr(*, "spec")=
  .. cols(
  ..   artist = col_character(),
  ..   song = col_character(),
  ..   duration_ms = col_double(),
  ..   explicit = col_logical(),
  ..   year = col_double(),
  ..   popularity = col_double(),
  ..   danceability = col_double(),
  ..   energy = col_double(),
  ..   key = col_double(),
  ..   loudness = col_double(),
  ..   mode = col_double(),
  ..   speechiness = col_double(),
  ..   acousticness = col_double(),
  ..   instrumentalness = col_double(),
  ..   liveness = col_double(),
  ..   valence = col_double(),
  ..   tempo = col_double(),
  ..   genre = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

From the source https://developer.spotify.com/documentation/web-api/reference/get-audio-features, I was able to find what the varibles explain. I will list the explanation for each variable below.

artist: chr - Artist of the song

song: chr - The title of a song

duration_ms: num - The duration of the song in milliseconds

explicit: logi - If the song has explicit lyrics

year: num - The year the song was released

popularity: num - The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity.

danceability: num - Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

energy: num - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

-Range: 0 - 1 -Example: 0.729

key: num - The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

-Range: -1 - 11 -Example: 9

loudness: num - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.

mode: num - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

speechiness: num - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

-Example: 0.0556

acousticness: num - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

-Range: 0 - 1 -Example: 0.00242

instrumentalness: num - Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

-Example: 0.00686

liveness: num - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

-Example: 0.0866

valence: num - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

-Range: 0 - 1 -Example: 0.428

tempo: num - The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

-Example: 118.211

genre: chr - The genre of the song

SO many intresting variables to work and choose from! I don’t know where to start!

Data Cleaning

I am going to start the cleaning process of the data, and make a few changes to the data set to the ones I want to work with.

First, I want to see if I have any NA’s

sum(is.na(spotifysongs))
[1] 0

Wow, I got lucky picking this data set. There is no NA’s, haha!

With what I know now about the variables, I know that I want to work with the danceability varible! I love dancing, and I would love to know what songs are the most danceable, and which artists are the most known for making danceable music. I also want to work with all the chr variables, and the popularity variable. Maybe the explicit variable too. Maybe looking at the correlation between the tempo and the danceability of a song would be interesting as well, and show me an intresting correlation!

I will now clean the data set to the variables I want to work with, also adding varibles I could possibly correlate with.

d1 <- spotifysongs |>
  select(artist, song, duration_ms, explicit, year, danceability, popularity, energy, loudness, speechiness, valence, tempo, genre) |>
  mutate(valence_category = case_when(
    valence >= 0 & valence < 0.2 ~ "Very Low",
    valence >= 0.2 & valence < 0.4 ~ "Low",
    valence >= 0.4 & valence < 0.6 ~ "Medium",
    valence >= 0.6 & valence < 0.8 ~ "High",
    valence >= 0.8 & valence <= 1 ~ "Very High"
  ))

Why did I do this you may ask? I want to make a high charter plot that shows the relationship with the Danceability and the Tempo, comparing the valence category. I think this would be a great way to show the relationship between the two variables, if higher the valence, the higher the dancebility. Usually happy music make people want to dance, but I can see different happy songs that dont really cause people to dance. ALSO– I can look for “sad” or “angry” songs (from the valence category being low) to see if some sad songs make people want to dance! (I love dancing, that is why I also want to see if any specific sad songs that are popular can be a good song to dance too and listen to haha)! It would be nice to see the comparison.

I also want to reorder the valence category to make it look better in the plot.

d1$valence_category <- factor(d1$valence_category, levels = c( "Very High", "High", "Medium", "Low", "Very Low"))

Now that I have a perfect data set, it’s time to find and make correlation models!

I’m going to make a Linear Regression Model. With the data set I created above, I want to see if there is a correlation between the danceability and the tempo of a song. I want to see if the tempo of a song can predict the danceability of a song. I think this would be a great model to make, and I am excited to see the results! (The faster a song, the more danceable it is. Or the slower a song, the more danceable it wont be). Before using the lm function, first I will use the cor function to find a correlation

cor(d1$danceability, d1$tempo)
[1] -0.173418

The number is -0.173418. This is a weak negative correlation (bummer, but good)! This is so intresting, because USUALLY, the lower the tempo, the lower the danceability. Apperently, this is not the case with these songs! Nice to know. Let me make a linear regression model for this and list my findinds, then look for other variables that may have a correlation with danceability.

lm1 <- lm(danceability ~ tempo, data = d1)
summary(lm1)

Call:
lm(formula = danceability ~ tempo, data = d1)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.57914 -0.08595  0.00890  0.09327  0.30548 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.7759056  0.0141238  54.936  < 2e-16 ***
tempo       -0.0009030  0.0001147  -7.871 5.72e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1383 on 1998 degrees of freedom
Multiple R-squared:  0.03007,   Adjusted R-squared:  0.02959 
F-statistic: 61.95 on 1 and 1998 DF,  p-value: 5.724e-15

The equation for the following correlation is y = -0.0009030x + 0.7759056. (danceability = -0.0009030tempo + 0.7759056)

The equation can be described as for every 1 unit increase in tempo, the danceability of a song decreases by 0.0009030 units. The intercept of the equation is 0.7759056, which means that if the tempo of a song is 0, the danceability of the song is 0.7759056.

we also need to look at the Adjusted R-Squared value. It states that 3% of the variation in the observations may be explained by the model.

in other words, the model is not a good fit for the data. That’s good though, which means I can look for other variables that may have a correlation with danceability! (I will not give up, I will find a good correlation with danceability)!

Before we do that, let’s look at the p-value. The p-value is 5.724e-15 in decimal form (0.000000000000005724). This suggests it is a meaningful variable to explain the linear decrease in dance ability as tempo increases (some how)!

I am going to compare multiple variables using scatterplot matrix

ggpairs(d1, columns = c(6, 7, 8, 9, 10, 11, 12))

To my suprise SOME HOW, the only significant corerlation is the loudness and the energy of a song. This is no suprise, since both the loudness of a song and the energy of a song are related to each other. When a song is louder, the energy and pace of any song is usually higher. This is a good correlation to look at. HOWEVER, I really want to use the danceability variable.

I also notice a weird trend with the popularity and the danceability of a song. I want to see if the popularity of a song can influence the danceability of a song. I will make a linear regression model for this and list my findings.

lm2 <- lm(danceability ~ popularity, data = d1)
summary(lm2)

Call:
lm(formula = danceability ~ popularity, data = d1)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.53981 -0.08649  0.00859  0.09679  0.30780 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.688e-01  9.358e-03  71.471   <2e-16 ***
popularity  -2.334e-05  1.472e-04  -0.158    0.874    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1405 on 1998 degrees of freedom
Multiple R-squared:  1.257e-05, Adjusted R-squared:  -0.0004879 
F-statistic: 0.02512 on 1 and 1998 DF,  p-value: 0.8741

I am going to explore and create a few simple plots to explore what I want to focus on for my final visualization.

My main goal is to use the valence category to determine how the valence category influences a specific song. I want to see if the valence category can predict the danceability of a song, especially. Let’s start with a box plot to see the relationship between the valence category and the danceability of a song.

d1 |>
  ggplot(aes(x = valence_category, y = danceability)) +
  geom_boxplot(fill = "blue") +
  labs(title = "Valence Category vs Danceability",
       x = "Valence Category",
       y = "Danceability",
       caption = "Source: Spotify WebScrapping")

No way, the higher the valence category, the higher the danceability of a song. This is so intresting! There also lot’s of songs that have high valence (which means happier songs) that dont have lot’s of danceability. As I imagines, the very low valence songs (sad or madder songs) have songs that have high dance abilites still.

With a little more thinking, I realized I want to change the way I go about finding the influence of the danceability. What if I found Data analysis on popularity and danceability instead? I will make a plot with x = year, y = danceability, and color = valence category. I want to see if the valence category can predict the danceability of a song, and see how the years go on, the danceability of a song changes.

d1 |>
  ggplot(aes(x = popularity, y = danceability, color = valence_category)) +
  geom_point() +
  labs(title = "Popularity vs Danceability",
       x = "Popularity",
       y = "Danceability",
       caption = "Source: Spotify WebScrapping")

What an intresting model. It’s time to make the plotly graph! I will make a plot to show the relationship between the danceability and the popularity of songs, comparing the valence category. x = popularity, y = danceability, and color = valence category. First I will graph it with ggplot2, then I will also put in the tooltip the artist, song name, and genre. Once I find a good visualization, I will then make my final visualzation and draw to conclusions. Don’t use plotly. Use costume colors

g1 <- d1 |>
  ggplot(aes(x = popularity, y = danceability, color = valence_category)) +
  geom_point() +
  labs(title = "Popularity vs Danceability",
       x = "Popularity",
       y = "Danceability",
       caption = "Source: Spotify WebScrapping")

g1 <- ggplotly(plot = g1)
g1 <- layout(g1, annotations = list(
  text = "Source: Spotify WebScrapping",
  x = 23, y = 0.25,  #Used first project for code from AI
  showarrow = FALSE
))

g1

This is an intresting data visualization, but it look’s a little to clustered. Instead of viewing all the songs, I will filter out for songs that have specific genres labeled as only one. THAT WAY, we can possibly see how Genres go into play with how Danceability is, and the popular the song is with a specific genre of music, how would it affect the dacneability. I will do this by listing each genre to find specifics ones.

#Show me all genre categories in the data set

unique(d1$genre)
 [1] "pop"                                  
 [2] "rock, pop"                            
 [3] "pop, country"                         
 [4] "rock, metal"                          
 [5] "hip hop, pop, R&B"                    
 [6] "hip hop"                              
 [7] "pop, rock"                            
 [8] "pop, R&B"                             
 [9] "Dance/Electronic"                     
[10] "pop, Dance/Electronic"                
[11] "rock, Folk/Acoustic, easy listening"  
[12] "metal"                                
[13] "hip hop, pop"                         
[14] "R&B"                                  
[15] "pop, latin"                           
[16] "Folk/Acoustic, rock"                  
[17] "pop, easy listening, Dance/Electronic"
[18] "rock"                                 
[19] "rock, blues, latin"                   
[20] "pop, rock, metal"                     
[21] "rock, pop, metal"                     
[22] "hip hop, R&B"                         
[23] "pop, Folk/Acoustic"                   
[24] "set()"                                
[25] "hip hop, pop, latin"                  
[26] "hip hop, Dance/Electronic"            
[27] "hip hop, pop, rock"                   
[28] "World/Traditional, Folk/Acoustic"     
[29] "Folk/Acoustic, pop"                   
[30] "rock, easy listening"                 
[31] "World/Traditional, hip hop"           
[32] "hip hop, pop, R&B, latin"             
[33] "rock, blues"                          
[34] "rock, R&B, Folk/Acoustic, pop"        
[35] "latin"                                
[36] "pop, R&B, Dance/Electronic"           
[37] "World/Traditional, rock"              
[38] "pop, rock, Dance/Electronic"          
[39] "pop, easy listening, jazz"            
[40] "rock, Dance/Electronic"               
[41] "World/Traditional, pop, Folk/Acoustic"
[42] "country"                              
[43] "hip hop, pop, Dance/Electronic"       
[44] "hip hop, pop, country"                
[45] "World/Traditional, rock, pop"         
[46] "World/Traditional, pop"               
[47] "hip hop, pop, R&B, Dance/Electronic"  
[48] "pop, R&B, easy listening"             
[49] "rock, pop, Dance/Electronic"          
[50] "Folk/Acoustic, rock, pop"             
[51] "rock, pop, metal, Dance/Electronic"   
[52] "pop, rock, Folk/Acoustic"             
[53] "country, latin"                       
[54] "rock, classical"                      
[55] "rock, Folk/Acoustic, pop"             
[56] "hip hop, rock, pop"                   
[57] "easy listening"                       
[58] "hip hop, latin, Dance/Electronic"     
[59] "hip hop, country"                     
d2 <- d1 |>
  filter(genre %in% c("pop", "Dance/Electronic", "latin", "easy listening", "hip hop", "metal", "R&B", "rock", "set()", "country"))

These are all of genres with one title. This way, we can relaly compare genres at the same time for Danceability with this grpah if we count how many songs with a specific genre has a high danceability rate or low one! We can also see the types of valence a grence is (Sad or happy). Not only that, but see how popular some songs are too! I am going to do one more data analysis test to see if the genre of a song can predict the danceability of a song. I will make a linear regression model for this and list my findings. Since I am using a character varible, I will only look for the p-value.

lm3 <- lm(danceability ~ genre, data = d2)
summary(lm3)

Call:
lm(formula = danceability ~ genre, data = d2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.51889 -0.07783  0.00811  0.09511  0.31911 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            0.57980    0.04244  13.662  < 2e-16 ***
genreDance/Electronic  0.10481    0.04733   2.214  0.02712 *  
genreeasy listening   -0.16280    0.14075  -1.157  0.24780    
genrehip hop           0.14403    0.04412   3.265  0.00115 ** 
genrelatin             0.15500    0.05479   2.829  0.00480 ** 
genremetal            -0.10636    0.06166  -1.725  0.08499 .  
genrepop               0.06809    0.04293   1.586  0.11319    
genreR&B               0.08366    0.05645   1.482  0.13875    
genrerock             -0.05359    0.04595  -1.166  0.24387    
genreset()             0.15920    0.05118   3.110  0.00194 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1342 on 711 degrees of freedom
Multiple R-squared:  0.1505,    Adjusted R-squared:  0.1398 
F-statistic:    14 on 9 and 711 DF,  p-value: < 2.2e-16

The p-value is 2.2e-16, which means that the genre of a song is a meaningful variable to explain the danceability of a song. This is a good sign! The data is ready to be visualized! The adjusted R-Squared value is 0.1398, which means that 13.98% of the variation in the observations may be explained by the model.

I will now make my FINAL plot to show the relationship between the danceability and the popularity of songs, comparing the valence category. x = popularity, y = danceability, and color = valence_category. I’m also going to make the size of danceability bigger the more high a song danceability is. I will also put in the tooltip the artist, song name, and genre. For the colors, I selected theese specific color’s since Spotify is a “green” business (from it’s logo and website). The less valence it is (sad or angry songs), it feels more “dull” so it’s more darker (blackish green). The lighter green (happier exciting songs) are more light green for expression! Once I find a good visualization, I will then make my final visualzation and draw to conclusions.

g2 <- d2 |>
  ggplot(aes(x = popularity, y = danceability, color = valence_category, size = danceability, text = paste("Artist:", artist, "<br>Song:", song, "<br>Genre:", genre))) +
  geom_point(alpha = 0.7, shape = 21, stroke = 1.5) +
  labs(title = "721 Songs with Diff Genres, Valence compares Popularity vs Danceability",
       x = "Popularity",
       y = "Danceability",
       caption = "Source: Spotify WebScrapping") +
  theme(legend.position = "bottom") +
  scale_color_manual(name = "Valence Categories | size of", values = c("Very Low" = "#012d21", "Low" = "#1e720f", Medium = "#1ab188", "High" = "#1ed800", "Very High" = "#00ff06")) +
  scale_y_continuous(breaks = seq(0, max(1), by = 0.1)) +
  scale_x_continuous(breaks = seq(0, max(100), by = 10)) +
  scale_size_continuous(range = c(1, 5)) +
  theme(text = element_text(family = "Montserrat")) #Used ChatAI to find some code for this


g2 <- ggplotly(plot = g2)
g2 <- layout(g2, annotations = list(
  text = "Source: Spotify WebScrapping",
  x = 20, y = 0.25,  #Used first project for code from AI
  showarrow = FALSE
))

g2 <- layout(g2, annotations = list(
  text = "Size: based on Danceability of a song",
  x = 22, y = 0.2,
  showarrow = FALSE
))

g2

Conclusion

a. The topic of the data, any variables included, what kind of variables they are, where the data came from and how you cleaned it up (be detailed and specific, using proper terminology where appropriate). Be sure to explain why you chose this topic and dataset – what meaning does it have for you? This part of the essay must go at the beginning, before you load data and libraries.

The main reason I chose this topic of data is because everyone around the world, including my self, is currently using Spotify to listen to TRILLIONS of music from all around the world, made up by so many genres and artists that make different type of music. With this data set, I got to see variables based on the songs, artists, and genres that are popular on Spotify. Each variable has it’s own contribution to the song, such as the danceability of a song, the energy of a song, the tempo of a song, and the valence of a song. In order to clean this data to show a visualization I wanted, I knew I wanted to work with danceability and valence from searching up what the variables showed. I also wanted to see if the popularity of a song could influence the dance ability of a song, the genres, and importantly the valence groups. I cleaned the data by selecting the variables I wanted to work with, and I also created a new variable called valence_category to categorize the valence of a song the way that made sense. I also filtered out the data to include only songs with specific genres that ARE specific and only belonging to one genre, That way, it is easier to see how genres comes into play with Dance ability. I love music and dance, which is why I was so engaged with the Dance ability variable, and I like how Valence could come into play with the mood of a song changing the way people may dance to a song.

b. Incorporate brief background research about this topic. This background information will include information you find in an article, website, or book. Please source this background information within the essay or if you have multiple sources, include a bibliography. I am not particular about the format of this bibliography. If you need help finding articles, I am happy to help you and/or show you how to search the MC Library Database.

According to the source https://developer.spotify.com/documentation/web-api/reference/get-audio-features, it is a “Web API” that enables many creators to interact with Spotify’s streaming service. How? by interacting with the content metadata, getting recommendations, creating and managing playlists, or controlling playback. It is all for developers to look into music and find context to all the music out there. All the data and variables are stored in a HTML format like place, with all the variables and context of the variables being in the website. From above, you can see all the variable meanings and context of what each number and character represents. For example, the danceability of a song is based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. It has a value of 0 - 1, with 1 being the most danceable. Another example is the valence of a song is a value between 0 and 1, with 1 being the most positive. The valence of a song is based on the musical positiveness conveyed by a track, and tracks with high valence sound more positive, while tracks with low valence sound more negative. With the power of Web scrapping, I was able to onvtain this data set from Professor Saidi in a google drive set for this class! I was able to clean the data set to the variables I wanted to work with, and I was able to make a few correlation models to see if there was a relationship between the danceability of a song and the tempo of a song, the popularity of a song, and the genre of a song, thanks to the data and webscrapping involved to help make this data set and analysis possible.

c. What the visualization represents, any interesting patterns or surprises that arise within the visualization, and anything that could have been shown that you could not get to work or that you wished you could have included.

A huge noteable pattern I saw was how the Very High and High categories compared to very Low valence, show that the Higher the valence, the more likely that any song will be danceable. This is a good pattern to see, since it shows that the happier the song, the more likely it is to be danceable. However, I was surprised to see that there were some songs that had high valence that were not danceable, and I can see that MIA, a song from Bad Bunny, is a very low valence song (sad and angry) but it’s so popular and danceable, isn’t that interesting? Some songs that had low valence are danceable. HOWEVER, I also realize that MAJORITY of the songs are not dancable, but we cna find out which low valence songs are from this graph now (I say look for songs over 5.5). There is also happy songs that are not danceable, such as American idiot in the High valence category. Also, the very high valance song “Cruise” by Florida Georgia Line is not as danceable. However, like I said, most of the songs that are danceable are in the high and very high valence category. I’m happy that we can find songs that dont match that standard though! The more you interact with the map and look for specific songs, you can see a variety. If I added more genres with this data, we could view even more songs that are danceable and popular with a specific genre! One thing I want to note out was that I was also surprised to see that the linear regression model for the danceability and the tempo of a song was not a good fit for the data. I was expecting to see a positive correlation between the two variables, but I was surprised to see that there was a weak negative correlation between the two variables. This was a good surprise, since it showed that the faster or slower a song, it doesn’t really affect how danceable it is as everyone thinks it might have. One last thing, from ALL the genres, I have seen Pop and Hip hop be the more valnced songs with high danceability than any other genre. The popularity doesn’t really have an effect on how the valence is. Everyone listen’s to all type of sad and angry happy music, so I wouldn’t expect the Popularity to change that. The populairty DID let us see the type of popular songs that were high and low with daneability. The very low popular sounds for all Valence groups showed no changes when it came to how popular a song was. Something that I wish I could have shown or get to work, would definitely putting OUTLINE MARKERS for the geom_point’s! If I could only know how to put a outline for the circles on the graph, I feel like my plot would be a lot more better visually. Also, if I knew how to get rid of some variables in the tooltip that weren’t to essential such as the valence category and dance ability TWICE… it would look so much better! It bums me that I couldn’t find the right code or remove something, but it’s okay. I’m really in love with the way my final visualization turned out either way.