library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
spotify_songs <- read.csv("C:/Users/priya/Downloads/spotify_songs.csv")
head(spotify_songs)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6 Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## 6 Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 6 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## 6 dance pop 0.675 0.919 8 -5.385 1 0.1270
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
## 6 0.0799 0.00e+00 0.1430 0.585 124.982 163049
spotify_subset <- spotify_songs %>%
select(track_popularity, playlist_genre, key)
head(spotify_subset)
## track_popularity playlist_genre key
## 1 66 pop 6
## 2 67 pop 11
## 3 70 pop 1
## 4 60 pop 7
## 5 69 pop 1
## 6 67 pop 8
According to the documentation, track_popularity
is an
integer ranging from 0 to 100, where higher values signify greater
popularity. The score is dynamically calculated based on factors such as
the track’s recent streams, listener engagement, and inclusion in
popular playlists.
Spotify uses this encoding as it allows for a standardized method of comparing tracks across the platform, regardless of genre, artist, or release date. This relative popularity score facilitates quick ranking, recommendation algorithms, and personalized user experiences, enabling Spotify to adapt to evolving listener preferences in real-time.
Without reading the documentation, one might assume that
track_popularity
directly corresponds to the total number
of streams, which isn’t the case. The score isn’t static and can change
frequently based on recent user interactions, playlist features, and
trends. This misunderstanding could lead to flawed analysis when
measuring a track’s long-term success or comparing older tracks to newer
ones, as older tracks might have fewer recent plays but higher lifetime
streams.
The playlist_genre
column assigns a genre to each track
based on the playlist it is most frequently associated with. This label
is broader and serves as a way to group tracks that belong to similar
musical styles, moods, or activities rather than being a precise genre
classification of the track itself.
By encoding playlist_genre
, Spotify enables users and
analysts to gain insights into how tracks are typically consumed within
different contexts. For instance, tracks placed in a “Workout” or
“Chill” playlist might cross genre boundaries but share similar energy
levels or tempos. This helps Spotify deliver more personalized playlist
recommendations and conduct genre-based analytics.
If the documentation is overlooked, one might incorrectly treat
playlist_genre
as the track’s inherent genre, leading to
misclassifications. For example, a jazz track placed in a “Chill”
playlist might be wrongly analyzed as chill music instead of its actual
genre, resulting in misleading conclusions about genre popularity,
listener preferences, or even influencing recommendation models
incorrectly.
The key
column uses integers from 0 to 11 to represent
musical keys, where each number corresponds to a specific musical key.
For instance, 0 represents C major, 1 represents C# or Db major, and so
forth up to 11, which represents B major. This encoding method is common
in music theory and digital audio processing.
This numerical representation simplifies data storage and processing for analytical tasks. Music theory applications, such as chord progressions and key analysis, often use this standardized encoding, making it easier to identify patterns, analyze harmonic content, and compare tracks across different genres and styles.
Without documentation, users might misinterpret this column as an arbitrary numerical value or overlook its musical significance, leading to confusion when attempting to perform harmonic analysis. For instance, incorrectly interpreting ‘5’ as a literal value instead of understanding that it represents the key of F major might lead to erroneous conclusions in genre analysis or recommendations based on harmonic compatibility.
summary(spotify_songs$tempo)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 99.96 121.98 120.88 133.92 239.44
spotify_songs %>%
filter(tempo < 0 | tempo > 250) %>%
select(track_name, tempo)
## [1] track_name tempo
## <0 rows> (or 0-length row.names)
The tempo column represents the speed of a track in beats per minute (BPM). While the documentation indicates it as BPM, there’s no further explanation on how Spotify handles tracks with inconsistent tempo changes or variations in time signatures throughout the song.
Some tracks have tempo shifts or aren’t strictly bound to a single BPM. It’s unclear whether Spotify uses an average tempo, a specific section, or if it’s measured differently when significant variations occur.
By examining tracks with extreme tempo values (e.g., very high or low BPM), you might find discrepancies or insights into how tempo is represented.
Misinterpretation of tempo data could affect analyses related to track energy, genre classification, or danceability, especially for genres with flexible tempo structures (e.g., jazz, progressive rock). This example highlights the importance of cross-checking values beyond relying solely on documentation to ensure accurate interpretation.
ggplot(spotify_songs, aes(x = tempo)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black", alpha = 0.7) +
geom_vline(xintercept = c(60, 120, 180), linetype = "dashed", color = "orange", linewidth = 0.8) +
labs(title = "Distribution of Track Tempo (BPM)",
x = "Tempo (Beats Per Minute)",
y = "Number of Tracks") +
theme_minimal() +
annotate("text", x = 200, y = 400, label = "Unusually high tempo values", color = "orange", size = 4, angle = 0) +
annotate("text", x = 30, y = 350, label = "Unusually low tempo values", color = "orange", size = 4, angle = 0)
The histogram visualizes the distribution of track tempo (BPM) across the dataset. The vertical dashed red lines at 60, 120, and 180 BPM represent typical tempo ranges in most music genres.
The documentation indicates that tempo represents beats per minute, but some tracks have unusually high or low tempo values (e.g., above 200 BPM or below 40 BPM). These outliers raise questions about how Spotify calculates tempo for tracks with tempo shifts or variable signatures.
It’s unclear if these extreme values result from inaccuracies, unique genre characteristics, or misinterpretation of tempo.
Using unverified tempo values might lead to incorrect genre classification or misjudgments about track energy levels.
To reduce negative consequences, apply additional filtering or smoothing techniques to identify and correct outliers, or combine tempo data with other features (e.g., energy, danceability) for more reliable analysis. Additionally, manual verification of a sample of tracks with extreme tempo values can provide insights into potential data inconsistencies.
track_popularity: This column provides a relative measure of a track’s success, which is significant for understanding trends in music popularity. It’s essential for building recommendation systems and identifying trending songs.
playlist_genre: This column helps categorize tracks based on the playlist rather than the song’s genre, offering insight into listener behavior and how certain tracks are associated with different moods or activities.
key: The musical key provides harmonic information, which can be crucial for understanding a track’s composition and identifying patterns in different genres.
Properly interpreting these columns ensures accurate analysis in genre classification, trend prediction, and music recommendation systems.
The tempo column, although representing BPM, may be inconsistent due to songs with tempo shifts or variable time signatures. It raises concerns about whether it’s an average, a section-based measure, or calculated differently.
Understanding the true nature of tempo is crucial, as it’s often used to predict energy levels, danceability, or genre classification. Misinterpretation could skew results, especially for genres with complex rhythms.
The histogram shows that most tracks fall within a typical tempo range (60-180 BPM), but there are some extreme outliers. These unusually high or low values suggest that tempo might not always be consistent or accurately measured for certain tracks.
Recognizing these outliers is crucial because they can significantly impact analyses related to energy, danceability, or genre classification. Misinterpreting them might lead to incorrect insights about music trends.