Data 110 Project 1 (Final)

Author

Chisom Anyanwu

Introduction

First load libraries and data set

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
setwd("/Users/blossomanyanwu/Documents/Data 110 (Fall 2023)")
spotify<- read_csv("spotifysongs.csv")
Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl  (1): explicit

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning: Next I had to clean my data. I had already selected the 4 pop artists I would be focusing on so I had to filter them out as well as add a column for track duration in minutes.

female_pop <- spotify %>% 
  filter(artist %in% c('Beyoncé', 'Britney Spears', 'Ariana Grande', 'Katy Perry'))
female_pop <- female_pop %>%
  mutate(duration_min = duration_ms / 60000)

Creating a Visualization

artist_colors <- c("Beyoncé" = "hotpink",
                   "Ariana Grande" = "green",
                   "Britney Spears" = "orange", 
                   "Katy Perry" = "blue")
pop_plot<-ggplot(female_pop, aes(x = duration_min, y = popularity, color = artist, size= danceability)) +
  geom_point(alpha = 1.0, position = "jitter") +
  labs(title = "Female Pop Profile: Duration vs. Popularity",
       x = "Duration (minutes)",
       y = "Popularity",
       size= "Danceability",
       color = "Artist") + 
  scale_color_manual(values = artist_colors) +
  theme_minimal()
pop_plot

This visualization is very cluttered and plotly was not working. So in order to fix this I made the decison to omit Brittney Spears data for my final visualization

modern_pop <- spotify %>% 
  filter(artist %in% c('Beyoncé', 'Ariana Grande', 'Katy Perry'))
artist_colors <- c("Beyoncé" = "hotpink",
                   "Ariana Grande" = "green",
                   "Katy Perry" = "blue")
modern_pop <- modern_pop %>%
  mutate(duration_min = duration_ms / 60000)
view(modern_pop)
ggplot(modern_pop, aes(x = duration_min, y = popularity, color = artist, size= danceability)) +
  geom_point(alpha = 0.7, position = "jitter") +
  labs(title = "Female Pop Profile: Track Duration vs. Popularity",
       x = "Duration (minutes)",
       y = "Popularity",
       size= "Danceability",
       color = "Artist",
       caption= "Source: Spotify") + 
  scale_color_manual(values = artist_colors) +
  theme_minimal()

Using the ’spotify.csv” dataset I decided to explore the popularity of female pop artists based on variables like danceability and valence. The first thing I did was load the necessary datasets and libraries. I loaded tidyverse and plotly. The initial dataset contained songs from various artists spanning multiple genres. However, I wanted to focus on 4 female pop artists so I had to clean the data. Using the filter command I refined the data set to include these artists: Beyonce, Katy Perry, Ariana Grande, and Britney Spears. Also I noticed that the song duration was in milliseconds and the numbers were very large (ex 253938) which would make the graph difficult to read. So I used the mutate command to create a new column that represents song duration in minutes.

Further Explorations

calvin_harris <- spotify %>% 
  filter(artist %in% c('Calvin Harris'))
calvin_harris <- calvin_harris %>%
  mutate(duration_min = duration_ms / 60000)

After cleaning the data I want to organize the songs by release date.

calvin_harris[calvin_harris$song == "Acceptable in the 80's", "release_date"] <- as.Date("2007-12-30")
calvin_harris[calvin_harris$song == "Acceptable in the 80's", "release_date"] <- as.Date("2007-12-30")
calvin_harris[calvin_harris$song == "I'm Not Alone - Radio Edit", "release_date"] <- as.Date("2009-06-04")
calvin_harris[calvin_harris$song == "Bounce (feat. Kelis) - Radio Edit", "release_date"] <- as.Date("2009-06-04")
calvin_harris[calvin_harris$song == "Feel So Close - Radio Edit", "release_date"] <- as.Date("2011-08-19")
calvin_harris[calvin_harris$song == "Let's Go (feat. Ne-Yo)", "release_date"] <- as.Date("2012-03-30")
calvin_harris[calvin_harris$song == "We'll Be Coming Back (feat. Example)", "release_date"] <- as.Date("2012-06-02")
calvin_harris[calvin_harris$song == "Drinking from the Bottle (feat. Tinie Tempah)", "release_date"] <- as.Date("2013-01-27")
calvin_harris[calvin_harris$song == "I Need Your Love (feat. Ellie Goulding)", "release_date"] <- as.Date("2013-04-02")
calvin_harris[calvin_harris$song == "Sweet Nothing (feat. Florence Welch)", "release_date"] <- as.Date("2012-10-12")
calvin_harris[calvin_harris$song == "Under Control (feat. Hurts)", "release_date"] <- as.Date("2013-10-07")
calvin_harris[calvin_harris$song == "Giant (with Rag'n'Bone Man)", "release_date"] <- as.Date("2019-01-11")
calvin_harris[calvin_harris$song == "One Kiss (with Dua Lipa)", "release_date"] <- as.Date("2018-04-06")
calvin_harris[calvin_harris$song == "Feels (feat. Pharrell Williams, Katy Perry & Big Sean)", "release_date"] <- as.Date("2017-06-15")
calvin_harris[calvin_harris$song == "My Way", "release_date"] <- as.Date("2016-09-16")
calvin_harris[calvin_harris$song == "This Is What You Came For (feat. Rihanna)", "release_date"] <- as.Date("2016-04-29")
calvin_harris[calvin_harris$song == "How Deep Is Your Love", "release_date"] <- as.Date("2015-07-15")
calvin_harris[calvin_harris$song == "Outside (feat. Ellie Goulding)", "release_date"] <- as.Date("2014-10-20")
calvin_harris[calvin_harris$song == "Summer", "release_date"] <- as.Date("2014-03-14")
calvin_harris[calvin_harris$song == "Blame (feat. John Newman)", "release_date"] <- as.Date("2014-09-05")
calvin_harris[calvin_harris$song== "Thinking About You (feat. Ayah Marar)", 
"release_date"] <- as.Date("2013-08-02")

Now that I have cleaned my data I moved on to creating my visualization. I wanted to create a bubble plot so I could display multiple variables. However, I had to mindful to not clutter my graph which would make it difficult to read.

p1<-ggplot(calvin_harris, aes(x = release_date, y = popularity, size = danceability, color = energy)) +
  geom_point(alpha = .8, shape = 19, position = "jitter") + 
  scale_size_continuous(range = c(2, 10)) +  
  scale_color_gradient(low = "red", high = "blue") + 
  labs(title = "Calvin Harris: Popularity over Time with Danceability & Valence",
       x = "Release Date",
       y = "Popularity",
       size = "Danceability",
       color = "Energy") +
  theme_minimal() +
  theme(legend.position="right") +
  guides(shape = FALSE)
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
p1