Data 110 Project 1 (Final)

Author

Chisom Anyanwu

Introduction

For this exploration I am using Spotify data that classifies popular songs spanning from 1998 to 2020. These songs are from varying genres and artists. In this data set the variables are as followed: artist name, song release year, genre, tempo, danceability, popularity, valence, whether a song is explicit or not, energy, acousticness, energy, and key. Danceability refers to how “suitable a track is for dancing based on a combination of elements including tempo, rhythm, beat strength, and more” (Spotify Song Stats). Also valence is a measure of the musical positiveness of a song. This means that songs with a high valence sound more happy and energetic but songs with low valence tend to be sad or angry in tone. For my visualization I plan to explore songs by female pop singers (Beyoncé, Katy Perry, Ariana Grande, and Brittney Spears) and analyze their popularity with respect to other variables. I hope to find out how song popularity for the pop genre varies depending on factors like track duration, danceability, and valence.

First load libraries and data set

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

setwd("/Users/blossomanyanwu/Documents/Data 110 (Fall 2023)")
spotify<- read_csv("spotifysongs.csv")

Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl  (1): explicit

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning: Next I had to clean my data. I had already selected the 4 pop artists I would be focusing on so I had to filter them out as well as add a column for track duration in minutes.

female_pop <- spotify %>% 
  filter(artist %in% c('Beyoncé', 'Britney Spears', 'Ariana Grande', 'Katy Perry'))
female_pop <- female_pop %>%
  mutate(duration_min = duration_ms / 60000)

Creating a Visualization

artist_colors <- c("Beyoncé" = "hotpink",
                   "Ariana Grande" = "green",
                   "Britney Spears" = "orange", 
                   "Katy Perry" = "blue")
pop_plot<-ggplot(female_pop, aes(x = duration_min, y = popularity, color = artist, size= danceability)) +
  geom_point(alpha = 1.0, position = "jitter") +
  labs(title = "Female Pop Profile: Duration vs. Popularity",
       x = "Duration (minutes)",
       y = "Popularity",
       size= "Danceability",
       color = "Artist") + 
  scale_color_manual(values = artist_colors) +
  theme_minimal()
pop_plot

This visualization is very cluttered and plotly was not working. So in order to fix this I made the decison to omit Brittney Spears data for my final visualization

modern_pop <- spotify %>% 
  filter(artist %in% c('Beyoncé', 'Ariana Grande', 'Katy Perry'))
artist_colors <- c("Beyoncé" = "hotpink",
                   "Ariana Grande" = "green",
                   "Katy Perry" = "blue")

modern_pop <- modern_pop %>%
  mutate(duration_min = duration_ms / 60000)

view(modern_pop)

ggplot(modern_pop, aes(x = duration_min, y = popularity, color = artist, size= danceability)) +
  geom_point(alpha = 0.7, position = "jitter") +
  labs(title = "Female Pop Profile: Track Duration vs. Popularity",
       x = "Duration (minutes)",
       y = "Popularity",
       size= "Danceability",
       color = "Artist",
       caption= "Source: Spotify") + 
  scale_color_manual(values = artist_colors) +
  theme_minimal()

Using the ’spotify.csv” dataset I decided to explore the popularity of female pop artists based on variables like danceability and valence. The first thing I did was load the necessary datasets and libraries. I loaded tidyverse and plotly. The initial dataset contained songs from various artists spanning multiple genres. However, I wanted to focus on 4 female pop artists so I had to clean the data. Using the filter command I refined the data set to include these artists: Beyonce, Katy Perry, Ariana Grande, and Britney Spears. Also I noticed that the song duration was in milliseconds and the numbers were very large (ex 253938) which would make the graph difficult to read. So I used the mutate command to create a new column that represents song duration in minutes.

Once my data was graphed I noticed a relationship between song duration and popularity. The majority of the songs that were ranked higher in terms of popularity tended to be shorter. They also were pretty high in danceability. However, not all of the popular songs were high in danceability. I used the jitter command to try and reduce some of the overlapping I had seen.

I think the biggest challenge I faced was the overlapping plot points. I used jitter and plotly to reduce that issue but they still overlapped a little bit. Another issue was that the plotly plot I made would not render so I had to delete it. After doing this I removed Britney as a variable and decided to focus on modern 2000s pop music. I also made the points more transparent so people can see the overlap. I could only show so many variables and I think that being able to include things like energy and valence could have made the visualization more interesting. I changed the colors I used multiple times to get a lot of contrast so the visualization was more readable. I think if I were to explore this further I would add release dates (like I did for the Calvin Harris data that I explore below this) and see pop trends over time as different pop artists make their debuts.

Further Explorations

calvin_harris <- spotify %>% 
  filter(artist %in% c('Calvin Harris'))
calvin_harris <- calvin_harris %>%
  mutate(duration_min = duration_ms / 60000)

After cleaning the data I want to organize the songs by release date.

calvin_harris[calvin_harris$song == "Acceptable in the 80's", "release_date"] <- as.Date("2007-12-30")
calvin_harris[calvin_harris$song == "Acceptable in the 80's", "release_date"] <- as.Date("2007-12-30")
calvin_harris[calvin_harris$song == "I'm Not Alone - Radio Edit", "release_date"] <- as.Date("2009-06-04")
calvin_harris[calvin_harris$song == "Bounce (feat. Kelis) - Radio Edit", "release_date"] <- as.Date("2009-06-04")
calvin_harris[calvin_harris$song == "Feel So Close - Radio Edit", "release_date"] <- as.Date("2011-08-19")
calvin_harris[calvin_harris$song == "Let's Go (feat. Ne-Yo)", "release_date"] <- as.Date("2012-03-30")
calvin_harris[calvin_harris$song == "We'll Be Coming Back (feat. Example)", "release_date"] <- as.Date("2012-06-02")
calvin_harris[calvin_harris$song == "Drinking from the Bottle (feat. Tinie Tempah)", "release_date"] <- as.Date("2013-01-27")
calvin_harris[calvin_harris$song == "I Need Your Love (feat. Ellie Goulding)", "release_date"] <- as.Date("2013-04-02")
calvin_harris[calvin_harris$song == "Sweet Nothing (feat. Florence Welch)", "release_date"] <- as.Date("2012-10-12")
calvin_harris[calvin_harris$song == "Under Control (feat. Hurts)", "release_date"] <- as.Date("2013-10-07")
calvin_harris[calvin_harris$song == "Giant (with Rag'n'Bone Man)", "release_date"] <- as.Date("2019-01-11")
calvin_harris[calvin_harris$song == "One Kiss (with Dua Lipa)", "release_date"] <- as.Date("2018-04-06")
calvin_harris[calvin_harris$song == "Feels (feat. Pharrell Williams, Katy Perry & Big Sean)", "release_date"] <- as.Date("2017-06-15")
calvin_harris[calvin_harris$song == "My Way", "release_date"] <- as.Date("2016-09-16")
calvin_harris[calvin_harris$song == "This Is What You Came For (feat. Rihanna)", "release_date"] <- as.Date("2016-04-29")
calvin_harris[calvin_harris$song == "How Deep Is Your Love", "release_date"] <- as.Date("2015-07-15")
calvin_harris[calvin_harris$song == "Outside (feat. Ellie Goulding)", "release_date"] <- as.Date("2014-10-20")
calvin_harris[calvin_harris$song == "Summer", "release_date"] <- as.Date("2014-03-14")
calvin_harris[calvin_harris$song == "Blame (feat. John Newman)", "release_date"] <- as.Date("2014-09-05")
calvin_harris[calvin_harris$song== "Thinking About You (feat. Ayah Marar)", 
"release_date"] <- as.Date("2013-08-02")

Now that I have cleaned my data I moved on to creating my visualization. I wanted to create a bubble plot so I could display multiple variables. However, I had to mindful to not clutter my graph which would make it difficult to read.

p1<-ggplot(calvin_harris, aes(x = release_date, y = popularity, size = danceability, color = energy)) +
  geom_point(alpha = .8, shape = 19, position = "jitter") + 
  scale_size_continuous(range = c(2, 10)) +  
  scale_color_gradient(low = "red", high = "blue") + 
  labs(title = "Calvin Harris: Popularity over Time with Danceability & Valence",
       x = "Release Date",
       y = "Popularity",
       size = "Danceability",
       color = "Energy") +
  theme_minimal() +
  theme(legend.position="right") +
  guides(shape = FALSE)

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.

p1