The dataset that I chose is one that takes the top 2000 songs from the years 2000 to 2019 and compiles information regarding them. It was scraped using the Spotipy library for Python which interacted directly with the Spotify API. The variables that I chose to focus on were artist, song, energy, loudness, and explicit. The variables are defined as followed:
artist: name of the artist
song: name of the track
energy: a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity
loudness: the overall volume of a track in decibels (dB)
explicit: whether the lyrics or content of a song or a music video contain one or more of the criteria which could be considered offensive or unsuitable for children
library(tidyverse)
library(dplyr)
library(ggplot2)
library(treemap)
data <- read.csv("C:/Users/rafiz/OneDrive/Desktop/data110/proj/songs_normalize.csv")
explicit_colors <- c("True" = "#84bd00", "False" = "#e57878") #Changing the colors for explicit status
ggplot(data, aes(x = energy, y = loudness, color = explicit)) +
geom_point(size = 1) +
geom_smooth(method = "lm", se = TRUE, color = "#000080") + # Adding linear regression line with confidence interval
labs(x = "Energy", y = "Loudness", caption = "Source: Spotipy Library through Spotify API") +
ggtitle("Scatter Plot of Song Energy vs. Loudness with Linear Regression and Explicity") +
theme_minimal() +
scale_color_manual(values = explicit_colors) +
theme(
plot.background = element_rect(fill = "black"), #change bg to black
panel.background = element_rect(fill = "black"),
panel.grid.major = element_blank(), #getting rid of grid lines
panel.grid.minor = element_blank(),
axis.text.x = element_text(color = "white"), # x-axis
axis.text.y = element_text(color = "white"), # y-axis
text = element_text(color = "white"), #change text to white
plot.title = element_text(color = "white") #title to white
)
## `geom_smooth()` using formula = 'y ~ x'
# arrange artists in descending order
artist_counts <- data |>
group_by(artist) |>
summarise(num_songs = n()) |>
ungroup() |>
arrange(desc(num_songs))
# top 25 artists
top_25_artists <- head(artist_counts, 25)
treemap(
top_25_artists,
index = "artist",
vSize = "num_songs",
title = "Treemap of Top 20 Artists by Number of Songs",
palette = "Purples",
)
Regarding cleaning; there were no holes in the particular variables that I chose to focus on so I did not have to do much cleaning when it came to the data I was using. My treemap takes the artists with the most entries in the data frame and displays the top 25. Based on the size of the treemap rectangles, out of the top 25 artists with the most entries, it is clear to see that Rihanna had the most songs in the dataframe while Avicii had the least (regarding the top 25 not the entire data set). I had initially tried to see how the treemap would look if I used every artist however, it was illegible and therefore intirely useless. I also attempted to display the number of songs that each artist had in the dataframe to give the viewer a better frame of reference however, adding that code kept giving me errors in other parts of my treemap and I was therefore unsuccessful. I also tried to add a caption to my treemap for the source of the data however, using ggplot’s ggtitle kept giving me errors.