Load necessary libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

Load the dataset

spotify_songs <- read.csv("C:/Users/priya/Downloads/spotify_songs.csv")

Display first few rows

head(spotify_songs)
##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef     Beautiful People (feat. Khalid) - Jack Wins Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6       Ed Sheeran               67 2yiy9cd2QktrNvWC2EUi0k
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
## 6     Beautiful People (feat. Khalid) [Jack Wins Remix]
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 6               2019-07-11     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
## 6         dance pop        0.675  0.919   8   -5.385    1      0.1270
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052
## 6       0.0799         0.00e+00   0.1430   0.585 124.982      163049

1. Referring to the documentation to identify unclear columns

Extracting the relevant columns to investigate

spotify_subset <- spotify_songs %>%
  select(track_popularity, playlist_genre, key)

Display the first few rows of the selected subset

head(spotify_subset)
##   track_popularity playlist_genre key
## 1               66            pop   6
## 2               67            pop  11
## 3               70            pop   1
## 4               60            pop   7
## 5               69            pop   1
## 6               67            pop   8

Unclear Columns Identified

Column 1: track_popularity

Description:

According to the documentation, track_popularity is an integer ranging from 0 to 100, where higher values signify greater popularity. The score is dynamically calculated based on factors such as the track’s recent streams, listener engagement, and inclusion in popular playlists.

Reason for Encoding:

Spotify uses this encoding as it allows for a standardized method of comparing tracks across the platform, regardless of genre, artist, or release date. This relative popularity score facilitates quick ranking, recommendation algorithms, and personalized user experiences, enabling Spotify to adapt to evolving listener preferences in real-time.

Implication if Not Documented:

Without reading the documentation, one might assume that track_popularity directly corresponds to the total number of streams, which isn’t the case. The score isn’t static and can change frequently based on recent user interactions, playlist features, and trends. This misunderstanding could lead to flawed analysis when measuring a track’s long-term success or comparing older tracks to newer ones, as older tracks might have fewer recent plays but higher lifetime streams.

Column 2: playlist_genre

Description:

The playlist_genre column assigns a genre to each track based on the playlist it is most frequently associated with. This label is broader and serves as a way to group tracks that belong to similar musical styles, moods, or activities rather than being a precise genre classification of the track itself.

Reason for Encoding:

By encoding playlist_genre, Spotify enables users and analysts to gain insights into how tracks are typically consumed within different contexts. For instance, tracks placed in a “Workout” or “Chill” playlist might cross genre boundaries but share similar energy levels or tempos. This helps Spotify deliver more personalized playlist recommendations and conduct genre-based analytics.

Implication if Not Documented:

If the documentation is overlooked, one might incorrectly treat playlist_genre as the track’s inherent genre, leading to misclassifications. For example, a jazz track placed in a “Chill” playlist might be wrongly analyzed as chill music instead of its actual genre, resulting in misleading conclusions about genre popularity, listener preferences, or even influencing recommendation models incorrectly.

Column 3: key

Description:

The key column uses integers from 0 to 11 to represent musical keys, where each number corresponds to a specific musical key. For instance, 0 represents C major, 1 represents C# or Db major, and so forth up to 11, which represents B major. This encoding method is common in music theory and digital audio processing.

Reason for Encoding:

This numerical representation simplifies data storage and processing for analytical tasks. Music theory applications, such as chord progressions and key analysis, often use this standardized encoding, making it easier to identify patterns, analyze harmonic content, and compare tracks across different genres and styles.

Implication if Not Documented:

Without documentation, users might misinterpret this column as an arbitrary numerical value or overlook its musical significance, leading to confusion when attempting to perform harmonic analysis. For instance, incorrectly interpreting ‘5’ as a literal value instead of understanding that it represents the key of F major might lead to erroneous conclusions in genre analysis or recommendations based on harmonic compatibility.

2. Investigating the ‘tempo’ column for issues not fully explained in the documentation

summary(spotify_songs$tempo)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   99.96  121.98  120.88  133.92  239.44

Checking for any unexpected values in the ‘tempo’ column

spotify_songs %>% 
  filter(tempo < 0 | tempo > 250) %>%
  select(track_name, tempo)
## [1] track_name tempo     
## <0 rows> (or 0-length row.names)

Unclear Element Even After Reading the Documentation

Element: tempo

Unclear Aspect:

The tempo column represents the speed of a track in beats per minute (BPM). While the documentation indicates it as BPM, there’s no further explanation on how Spotify handles tracks with inconsistent tempo changes or variations in time signatures throughout the song.

Why It’s Unclear:

Some tracks have tempo shifts or aren’t strictly bound to a single BPM. It’s unclear whether Spotify uses an average tempo, a specific section, or if it’s measured differently when significant variations occur.

Further Investigation:

By examining tracks with extreme tempo values (e.g., very high or low BPM), you might find discrepancies or insights into how tempo is represented.

Impact:

Misinterpretation of tempo data could affect analyses related to track energy, genre classification, or danceability, especially for genres with flexible tempo structures (e.g., jazz, progressive rock). This example highlights the importance of cross-checking values beyond relying solely on documentation to ensure accurate interpretation.

visualizing the distribution of tempo values

ggplot(spotify_songs, aes(x = tempo)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black", alpha = 0.7) +
  geom_vline(xintercept = c(60, 120, 180), linetype = "dashed", color = "orange", linewidth = 0.8) +
  labs(title = "Distribution of Track Tempo (BPM)",
       x = "Tempo (Beats Per Minute)",
       y = "Number of Tracks") +
  theme_minimal() +
  annotate("text", x = 200, y = 400, label = "Unusually high tempo values", color = "orange", size = 4, angle = 0) +
  annotate("text", x = 30, y = 350, label = "Unusually low tempo values", color = "orange", size = 4, angle = 0)

The histogram visualizes the distribution of track tempo (BPM) across the dataset. The vertical dashed red lines at 60, 120, and 180 BPM represent typical tempo ranges in most music genres.

What’s Unclear and Why?

  1. The documentation indicates that tempo represents beats per minute, but some tracks have unusually high or low tempo values (e.g., above 200 BPM or below 40 BPM). These outliers raise questions about how Spotify calculates tempo for tracks with tempo shifts or variable signatures.

  2. It’s unclear if these extreme values result from inaccuracies, unique genre characteristics, or misinterpretation of tempo.

Significant Risks and Mitigation

Risk:

Using unverified tempo values might lead to incorrect genre classification or misjudgments about track energy levels.

Mitigation:

To reduce negative consequences, apply additional filtering or smoothing techniques to identify and correct outliers, or combine tempo data with other features (e.g., energy, danceability) for more reliable analysis. Additionally, manual verification of a sample of tracks with extreme tempo values can provide insights into potential data inconsistencies.

1. Insight from Unclear Columns Identified

Columns: track_popularity, playlist_genre, key

Insight Gathered:

  1. track_popularity: This column provides a relative measure of a track’s success, which is significant for understanding trends in music popularity. It’s essential for building recommendation systems and identifying trending songs.

  2. playlist_genre: This column helps categorize tracks based on the playlist rather than the song’s genre, offering insight into listener behavior and how certain tracks are associated with different moods or activities.

  3. key: The musical key provides harmonic information, which can be crucial for understanding a track’s composition and identifying patterns in different genres.

Significance:

Properly interpreting these columns ensures accurate analysis in genre classification, trend prediction, and music recommendation systems.

Further Questions:

  1. How often do the track_popularity values change, and how does Spotify update them over time?
  2. Does the playlist_genre reflect listener preferences more accurately than traditional genre labels?
  3. Are certain musical keys more prevalent in specific genres or playlists?

2. Insight from the Unclear Element Even After Reading the Documentation

Element: tempo

Insight Gathered:

The tempo column, although representing BPM, may be inconsistent due to songs with tempo shifts or variable time signatures. It raises concerns about whether it’s an average, a section-based measure, or calculated differently.

Significance:

Understanding the true nature of tempo is crucial, as it’s often used to predict energy levels, danceability, or genre classification. Misinterpretation could skew results, especially for genres with complex rhythms.

Further Questions:

  1. How does Spotify handle tracks with varying tempo? Are there other attributes that capture these shifts?
  2. Would combining tempo with another metric (e.g., energy) provide a more comprehensive understanding of a song’s rhythm?

3. Insight from the Visualization

Insight Gathered:

The histogram shows that most tracks fall within a typical tempo range (60-180 BPM), but there are some extreme outliers. These unusually high or low values suggest that tempo might not always be consistent or accurately measured for certain tracks.

Significance:

Recognizing these outliers is crucial because they can significantly impact analyses related to energy, danceability, or genre classification. Misinterpreting them might lead to incorrect insights about music trends.

Further Questions:

  1. Are the extreme tempo values accurate reflections of the tracks, or are they due to data recording errors?
  2. Could incorporating other rhythmic attributes help to validate or adjust these tempo values?