Midterm Presentation

Author

Ike Charistan

Introduction

Project 1: Spotify Songs Analysis

For this project, I’m exploring a dataset of songs from Spotify to see how different musical features relate to a song’s popularity. The dataset includes both numbers (like tempo and popularity score) and categories (like genre and song name), making it great for visual analysis.

I’ll focus on two main questions: Do more danceable or faster songs tend to be more popular? And do certain genres consistently perform better than others? To answer this, I’ll look at relationships between danceability, tempo, popularity, and genre.

Source: Spotify.com

Load the Dataset

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Desktop/Data 110")
songs <- read_csv("spotifysongs.csv")
Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl  (1): explicit

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(songs)
# A tibble: 6 × 18
  artist   song  duration_ms explicit  year popularity danceability energy   key
  <chr>    <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
1 Britney… Oops…      211160 FALSE     2000         77        0.751  0.834     1
2 blink-1… All …      167066 FALSE     1999         79        0.434  0.897     0
3 Faith H… Brea…      250546 FALSE     1999         66        0.529  0.496     7
4 Bon Jovi It's…      224493 FALSE     2000         78        0.551  0.913     0
5 *NSYNC   Bye …      200560 FALSE     2000         65        0.614  0.928     8
6 Sisqo    Thon…      253733 TRUE      1999         69        0.706  0.888     2
# ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>

Clean the Data

str(songs)
spc_tbl_ [2,000 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ artist          : chr [1:2000] "Britney Spears" "blink-182" "Faith Hill" "Bon Jovi" ...
 $ song            : chr [1:2000] "Oops!...I Did It Again" "All The Small Things" "Breathe" "It's My Life" ...
 $ duration_ms     : num [1:2000] 211160 167066 250546 224493 200560 ...
 $ explicit        : logi [1:2000] FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ year            : num [1:2000] 2000 1999 1999 2000 2000 ...
 $ popularity      : num [1:2000] 77 79 66 78 65 69 86 68 75 77 ...
 $ danceability    : num [1:2000] 0.751 0.434 0.529 0.551 0.614 0.706 0.949 0.708 0.713 0.72 ...
 $ energy          : num [1:2000] 0.834 0.897 0.496 0.913 0.928 0.888 0.661 0.772 0.678 0.808 ...
 $ key             : num [1:2000] 1 0 7 0 8 2 5 7 5 6 ...
 $ loudness        : num [1:2000] -5.44 -4.92 -9.01 -4.06 -4.81 ...
 $ mode            : num [1:2000] 0 1 1 0 0 1 0 1 0 1 ...
 $ speechiness     : num [1:2000] 0.0437 0.0488 0.029 0.0466 0.0516 0.0654 0.0572 0.0322 0.102 0.0379 ...
 $ acousticness    : num [1:2000] 0.3 0.0103 0.173 0.0263 0.0408 0.119 0.0302 0.0267 0.273 0.00793 ...
 $ instrumentalness: num [1:2000] 1.77e-05 0.00 0.00 1.35e-05 1.04e-03 9.64e-05 0.00 0.00 0.00 2.93e-02 ...
 $ liveness        : num [1:2000] 0.355 0.612 0.251 0.347 0.0845 0.07 0.0454 0.467 0.149 0.0634 ...
 $ valence         : num [1:2000] 0.894 0.684 0.278 0.544 0.879 0.714 0.76 0.861 0.734 0.869 ...
 $ tempo           : num [1:2000] 95.1 148.7 136.9 120 172.7 ...
 $ genre           : chr [1:2000] "pop" "rock, pop" "pop, country" "rock, metal" ...
 - attr(*, "spec")=
  .. cols(
  ..   artist = col_character(),
  ..   song = col_character(),
  ..   duration_ms = col_double(),
  ..   explicit = col_logical(),
  ..   year = col_double(),
  ..   popularity = col_double(),
  ..   danceability = col_double(),
  ..   energy = col_double(),
  ..   key = col_double(),
  ..   loudness = col_double(),
  ..   mode = col_double(),
  ..   speechiness = col_double(),
  ..   acousticness = col_double(),
  ..   instrumentalness = col_double(),
  ..   liveness = col_double(),
  ..   valence = col_double(),
  ..   tempo = col_double(),
  ..   genre = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
dim(songs)
[1] 2000   18
tail(songs)
# A tibble: 6 × 18
  artist   song  duration_ms explicit  year popularity danceability energy   key
  <chr>    <chr>       <dbl> <lgl>    <dbl>      <dbl>        <dbl>  <dbl> <dbl>
1 Post Ma… Good…      174960 TRUE      2019          1        0.58   0.653     5
2 Jonas B… Suck…      181026 FALSE     2019         79        0.842  0.734     1
3 Taylor … Crue…      178426 FALSE     2019         78        0.552  0.702     9
4 Blanco … The …      200593 FALSE     2019         69        0.847  0.678     9
5 Sam Smi… Danc…      171029 FALSE     2019         75        0.741  0.52      8
6 Post Ma… Circ…      215280 FALSE     2019         85        0.695  0.762     0
# ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, genre <chr>
sum(is.na(songs))
[1] 0
any(is.na(songs))
[1] FALSE

Which genres have the highest average popularity, and how do their average danceability and tempo compare?

library(dplyr)
songs_summary <- songs |>
  filter(!is.na(popularity), !is.na(genre), !is.na(danceability), !is.na(tempo)) |>
  group_by(genre) |>
  summarize(
    avg_popularity = mean(popularity, na.rm = TRUE),
    avg_danceability = mean(danceability, na.rm = TRUE),
    avg_tempo = mean(tempo, na.rm = TRUE),
    count = n()
  ) |>
  arrange(desc(avg_popularity))
songs_summary
# A tibble: 59 × 5
   genre                         avg_popularity avg_danceability avg_tempo count
   <chr>                                  <dbl>            <dbl>     <dbl> <int>
 1 pop, rock, Folk/Acoustic                79              0.509     108.      2
 2 Folk/Acoustic, pop                      78              0.563     112.      2
 3 rock, pop, metal, Dance/Elec…           76              0.507     105.      1
 4 hip hop, rock, pop                      75              0.526      90.1     2
 5 easy listening                          72              0.417     158.      1
 6 hip hop, latin, Dance/Electr…           72              0.767     172.      1
 7 metal                                   72              0.473     106.      9
 8 rock, Folk/Acoustic, pop                71              0.556      80.5     1
 9 rock, metal                             70.9            0.537     126.     38
10 World/Traditional, Folk/Acou…           69              0.418      82.8     1
# ℹ 49 more rows

Data Visualisation

How danceability and tempo relate to popularity across genres.

library(ggplot2)
 ggplot(songs_summary, aes(x = avg_danceability, y = avg_tempo, color = genre, size = avg_popularity)) +
  geom_point(alpha = 0.8) +
  labs(
    title = "Average Danceability vs. Tempo by Genre",
    subtitle = "Point size shows average popularity; color shows genre",
    x = "Average Danceability",
    y = "Average Tempo (BPM)",
    color = "Genre",
    size = "Avg Popularity",
    caption = "Source: Spotify dataset via Spotify Web API"
  ) +
  scale_color_viridis_d(option = "plasma") +
  theme_minimal()

##Average Popularity by Genre

library(ggplot2)
genre_popularity <- songs |>
  filter(
    !is.na(genre),
    !is.na(popularity),
    popularity <= 100
  ) |>
  group_by(genre) |>
  summarize(
    avg_popularity = mean(popularity, na.rm = TRUE),
    count = n()
  ) |>
  filter(count >= 10) %>%  # Keep only genres with enough data
  arrange(desc(avg_popularity))

ggplot(genre_popularity, aes(x = reorder(genre, avg_popularity), y = avg_popularity, fill = genre)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +  # Flip for better readability
  labs(
    title = "Average Popularity by Genre",
    x = "Genre",
    y = "Average Popularity",
    caption = "Source: Spotify dataset via Spotify Web API"
  ) +
  scale_fill_viridis_d(option = "magma") +
  theme_minimal()

Conclusion

Data Cleaning To prepare the dataset, I removed rows that had missing values in genre or popularity, and I also filtered out songs with a popularity score above 100. Then I grouped the data by genre and used summarize() to calculate the average popularity for each one. I kept only genres with at least 10 songs so the results would be more reliable.

What the Visualization Shows The bar plot shows the average popularity of songs across different genres. It makes it easy to compare which genres tend to perform better on Spotify. For example, I noticed that genres like Dance and Pop Rap had some of the highest average popularity scores, while genres like Classical and Folk were lower. This suggests that upbeat or modern genres tend to attract more streams and listeners on the platform.