PROJECT OVERVIEW : This project explores the “Most Streamed Spotify Songs of 2024” dataset to uncover trends, insights, and patterns in music streaming across multiple platforms like Spotify, YouTube, TikTok, SoundCloud, and others. Given the rapidly evolving landscape of music consumption, analyzing such a dataset offers valuable insights into which songs and artists are dominating across platforms and what factors contribute most to their popularity.The dataset includes detailed metrics such as stream counts, playlist reach, social engagement, and rank history across platforms. It also includes metadata like release dates, artists, genres, and record label
Objectives:
1. Data Cleaning and Preprocessing: To systematically clean and preprocess the dataset by addressing missing values, removing duplicate entries, and ensuring that all data types are correctly formatted for accurate analysis.
2. Exploratory Data Analysis: To explore and analyze key streaming metrics, including Spotify streams, YouTube views, and TikTok engagement, in order to identify patterns, trends, and insights related to song performance.
3. Artist and Record Label Evaluation: To assess the influence of artists and record labels by identifying the most successful contributors based on streaming volume, social media presence, and overall impact on the music landscape
4. Cross-Platform Performance Comparison: To compare performance metrics across multiple platforms and investigate the relationships and correlations between them, particularly focusing on the connection between YouTube views and Spotify streams.
5. Identification of Top and Consistent Performers: To identify top-ranking songs and those that consistently trend across different platforms over time, highlighting tracks that maintain popularity and engagement.
6. Aggregation and Summarization of Streaming Data: To aggregate and summarize streaming data by artist and platform, providing a comprehensive understanding of audience engagement, content reach, and market dynamics.
7. Feature Engineering for Enhanced Insights: To create meaningful new features, such as composite popularity scores, engagement ratios, and platform diversity metrics, thereby enriching the dataset for deeper insights and advanced analytical modeling.
#1.loading the data set
data <- read_csv("C:/Users/ACER/Desktop/Most Streamed Spotify Songs 2024.csv")
## Rows: 4600 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Track, Album Name, Artist, Release Date, ISRC
## dbl (6): Track Score, Spotify Popularity, Apple Music Playlist Count, Deeze...
## num (17): All Time Rank, Spotify Streams, Spotify Playlist Count, Spotify Pl...
## lgl (1): TIDAL Popularity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#1.Understanding the data set
# Remove columns that contain all NA values
data <- data[, colSums(is.na(data)) < nrow(data)]
# Fill NA values in numeric columns with their respective mean values
numeric_columns <- names(data)[sapply(data, is.numeric)]
for (col in numeric_columns) {
data[[col]][is.na(data[[col]])] <- mean(data[[col]], na.rm = TRUE)
}
# Check if missing values have been handled
missing_values <- colSums(is.na(data))
print("Missing values per column after handling:")
## [1] "Missing values per column after handling:"
print(missing_values)
## Track Album Name
## 0 0
## Artist Release Date
## 5 0
## ISRC All Time Rank
## 0 0
## Track Score Spotify Streams
## 0 0
## Spotify Playlist Count Spotify Playlist Reach
## 0 0
## Spotify Popularity YouTube Views
## 0 0
## YouTube Likes TikTok Posts
## 0 0
## TikTok Likes TikTok Views
## 0 0
## YouTube Playlist Reach Apple Music Playlist Count
## 0 0
## AirPlay Spins SiriusXM Spins
## 0 0
## Deezer Playlist Count Deezer Playlist Reach
## 0 0
## Amazon Playlist Count Pandora Streams
## 0 0
## Pandora Track Stations Soundcloud Streams
## 0 0
## Shazam Counts Explicit Track
## 0 0
#a.Column names are very important as they provide context and meaning to the data contained within each column.
print(colnames(data))
## [1] "Track" "Album Name"
## [3] "Artist" "Release Date"
## [5] "ISRC" "All Time Rank"
## [7] "Track Score" "Spotify Streams"
## [9] "Spotify Playlist Count" "Spotify Playlist Reach"
## [11] "Spotify Popularity" "YouTube Views"
## [13] "YouTube Likes" "TikTok Posts"
## [15] "TikTok Likes" "TikTok Views"
## [17] "YouTube Playlist Reach" "Apple Music Playlist Count"
## [19] "AirPlay Spins" "SiriusXM Spins"
## [21] "Deezer Playlist Count" "Deezer Playlist Reach"
## [23] "Amazon Playlist Count" "Pandora Streams"
## [25] "Pandora Track Stations" "Soundcloud Streams"
## [27] "Shazam Counts" "Explicit Track"
Interpretation: The dataset contains multiple metrics measuring a song’s performance across different platforms, along with metadata like genre, artist, and label.
#b.Are there any missing or duplicate values in the dataset?
# Check for missing values
missing_values <- colSums(is.na(data))
print(missing_values)
## Track Album Name
## 0 0
## Artist Release Date
## 5 0
## ISRC All Time Rank
## 0 0
## Track Score Spotify Streams
## 0 0
## Spotify Playlist Count Spotify Playlist Reach
## 0 0
## Spotify Popularity YouTube Views
## 0 0
## YouTube Likes TikTok Posts
## 0 0
## TikTok Likes TikTok Views
## 0 0
## YouTube Playlist Reach Apple Music Playlist Count
## 0 0
## AirPlay Spins SiriusXM Spins
## 0 0
## Deezer Playlist Count Deezer Playlist Reach
## 0 0
## Amazon Playlist Count Pandora Streams
## 0 0
## Pandora Track Stations Soundcloud Streams
## 0 0
## Shazam Counts Explicit Track
## 0 0
# Check for duplicate rows
duplicate_rows <- sum(duplicated(data))
print(paste("Number of duplicate rows:", duplicate_rows))
## [1] "Number of duplicate rows: 2"
Interpretation: The dataset had missing values which have been handled. There were also some duplicate entries that were identified. Cleaning ensures more reliable analysis
#c.Convert the date column to Date format (assuming the column is named 'date')
if ("date" %in% colnames(data)) {
data$date <- as.Date(data$date, format="%Y-%m-%d")
# Find the time range
time_range <- range(data$date, na.rm = TRUE)
print(time_range)
} else {
print("No 'date' column found in the dataset.")
}
## [1] "No 'date' column found in the dataset."
Interpretation: If the date column exists, it gives insight into the song’s lifespan. The dataset covers a specific period, useful for trend analysis. If not found, this step might need revisiting.
#d. Display data types and structure of the dataset
print("Dataset structure:")
## [1] "Dataset structure:"
str(data)
## tibble [4,600 × 28] (S3: tbl_df/tbl/data.frame)
## $ Track : chr [1:4600] "MILLION DOLLAR BABY" "Not Like Us" "i like the way you kiss me" "Flowers" ...
## $ Album Name : chr [1:4600] "Million Dollar Baby - Single" "Not Like Us" "I like the way you kiss me" "Flowers - Single" ...
## $ Artist : chr [1:4600] "Tommy Richman" "Kendrick Lamar" "Artemas" "Miley Cyrus" ...
## $ Release Date : chr [1:4600] "4/26/2024" "5/4/2024" "3/19/2024" "1/12/2023" ...
## $ ISRC : chr [1:4600] "QM24S2402528" "USUG12400910" "QZJ842400387" "USSM12209777" ...
## $ All Time Rank : num [1:4600] 1 2 3 4 5 6 7 8 9 10 ...
## $ Track Score : num [1:4600] 725 546 538 445 423 ...
## $ Spotify Streams : num [1:4600] 3.90e+08 3.24e+08 6.01e+08 2.03e+09 1.07e+08 ...
## $ Spotify Playlist Count : num [1:4600] 30716 28113 54331 269802 7223 ...
## $ Spotify Playlist Reach : num [1:4600] 1.97e+08 1.75e+08 2.12e+08 1.37e+08 1.51e+08 ...
## $ Spotify Popularity : num [1:4600] 92 92 92 85 88 ...
## $ YouTube Views : num [1:4600] 8.43e+07 1.16e+08 1.23e+08 1.10e+09 7.74e+07 ...
## $ YouTube Likes : num [1:4600] 1713126 3486739 2228730 10629796 3670188 ...
## $ TikTok Posts : num [1:4600] 5767700 674700 3025400 7189811 16400 ...
## $ TikTok Likes : num [1:4600] 6.52e+08 3.52e+07 2.75e+08 1.08e+09 1.13e+08 ...
## $ TikTok Views : num [1:4600] 5.33e+09 2.08e+08 3.37e+09 1.46e+10 1.16e+09 ...
## $ YouTube Playlist Reach : num [1:4600] 1.51e+08 1.56e+08 3.74e+08 3.35e+09 1.13e+08 ...
## $ Apple Music Playlist Count: num [1:4600] 210 188 190 394 182 ...
## $ AirPlay Spins : num [1:4600] 40975 40778 74333 1474799 12185 ...
## $ SiriusXM Spins : num [1:4600] 684 3 536 2182 1 ...
## $ Deezer Playlist Count : num [1:4600] 62 67 136 264 82 ...
## $ Deezer Playlist Reach : num [1:4600] 17598718 10422430 36321847 24684248 17660624 ...
## $ Amazon Playlist Count : num [1:4600] 114 111 172 210 105 ...
## $ Pandora Streams : num [1:4600] 1.80e+07 7.78e+06 5.02e+06 1.90e+08 4.49e+06 ...
## $ Pandora Track Stations : num [1:4600] 22931 28444 5639 203384 7006 ...
## $ Soundcloud Streams : num [1:4600] 4818457 6623075 7208651 14847968 207179 ...
## $ Shazam Counts : num [1:4600] 2669262 1118279 5285340 11822942 457017 ...
## $ Explicit Track : num [1:4600] 0 1 0 0 1 1 0 1 1 1 ...
#Identifying and apply necessary data transformations
print("Applying necessary transformations...")
## [1] "Applying necessary transformations..."
# Convert character columns that should be categorical (factor)
categorical_columns <- c("artist", "genre", "region", "label")
for (col in categorical_columns) {
if (col %in% colnames(data)) {
data[[col]] <- as.factor(data[[col]])
print(paste("Converted", col, "to factor (categorical)."))
}
}
# Convert numeric columns stored as characters to numeric
numeric_columns <- c("streams", "daily_streams", "social_mentions", "artist_popularity")
for (col in numeric_columns) {
if (col %in% colnames(data)) {
data[[col]] <- as.numeric(gsub(",", "", data[[col]])) # Remove commas before conversion
print(paste("Converted", col, "to numeric."))
}
}
# Final check on dataset structure after transformations
print("Updated dataset structure:")
## [1] "Updated dataset structure:"
str(data)
## tibble [4,600 × 28] (S3: tbl_df/tbl/data.frame)
## $ Track : chr [1:4600] "MILLION DOLLAR BABY" "Not Like Us" "i like the way you kiss me" "Flowers" ...
## $ Album Name : chr [1:4600] "Million Dollar Baby - Single" "Not Like Us" "I like the way you kiss me" "Flowers - Single" ...
## $ Artist : chr [1:4600] "Tommy Richman" "Kendrick Lamar" "Artemas" "Miley Cyrus" ...
## $ Release Date : chr [1:4600] "4/26/2024" "5/4/2024" "3/19/2024" "1/12/2023" ...
## $ ISRC : chr [1:4600] "QM24S2402528" "USUG12400910" "QZJ842400387" "USSM12209777" ...
## $ All Time Rank : num [1:4600] 1 2 3 4 5 6 7 8 9 10 ...
## $ Track Score : num [1:4600] 725 546 538 445 423 ...
## $ Spotify Streams : num [1:4600] 3.90e+08 3.24e+08 6.01e+08 2.03e+09 1.07e+08 ...
## $ Spotify Playlist Count : num [1:4600] 30716 28113 54331 269802 7223 ...
## $ Spotify Playlist Reach : num [1:4600] 1.97e+08 1.75e+08 2.12e+08 1.37e+08 1.51e+08 ...
## $ Spotify Popularity : num [1:4600] 92 92 92 85 88 ...
## $ YouTube Views : num [1:4600] 8.43e+07 1.16e+08 1.23e+08 1.10e+09 7.74e+07 ...
## $ YouTube Likes : num [1:4600] 1713126 3486739 2228730 10629796 3670188 ...
## $ TikTok Posts : num [1:4600] 5767700 674700 3025400 7189811 16400 ...
## $ TikTok Likes : num [1:4600] 6.52e+08 3.52e+07 2.75e+08 1.08e+09 1.13e+08 ...
## $ TikTok Views : num [1:4600] 5.33e+09 2.08e+08 3.37e+09 1.46e+10 1.16e+09 ...
## $ YouTube Playlist Reach : num [1:4600] 1.51e+08 1.56e+08 3.74e+08 3.35e+09 1.13e+08 ...
## $ Apple Music Playlist Count: num [1:4600] 210 188 190 394 182 ...
## $ AirPlay Spins : num [1:4600] 40975 40778 74333 1474799 12185 ...
## $ SiriusXM Spins : num [1:4600] 684 3 536 2182 1 ...
## $ Deezer Playlist Count : num [1:4600] 62 67 136 264 82 ...
## $ Deezer Playlist Reach : num [1:4600] 17598718 10422430 36321847 24684248 17660624 ...
## $ Amazon Playlist Count : num [1:4600] 114 111 172 210 105 ...
## $ Pandora Streams : num [1:4600] 1.80e+07 7.78e+06 5.02e+06 1.90e+08 4.49e+06 ...
## $ Pandora Track Stations : num [1:4600] 22931 28444 5639 203384 7006 ...
## $ Soundcloud Streams : num [1:4600] 4818457 6623075 7208651 14847968 207179 ...
## $ Shazam Counts : num [1:4600] 2669262 1118279 5285340 11822942 457017 ...
## $ Explicit Track : num [1:4600] 0 1 0 0 1 1 0 1 1 1 ...
Interpretation: Proper data types are essential for accurate analysis. This step ensures categorical and numerical columns are correctly processed for statistical operations.
# 2. Data Extraction & Filtering
#(a) How many songs have more than 100 million streams?
# Filter songs with more than 100 million streams
high_stream_songs <- data %>% filter(`Spotify Streams` > 100000000)
# Count the number of such songs
num_high_stream_songs <- nrow(high_stream_songs)
# Print the result
print(paste("Number of songs with more than 100 million streams:", num_high_stream_songs))
## [1] "Number of songs with more than 100 million streams: 3226"
Interpretation: These songs represent the viral hits of 2024, indicating strong listener engagement and possibly wide demographic appeal.
#(b) Which artist has the most songs in the top 100 streamed songs?
# Select the top 100 streamed songs
top_100_songs <- data %>% arrange(desc(`Spotify Streams`)) %>% head(100)
# Count the number of songs per artist
artist_song_count <- top_100_songs %>% count(Artist, sort = TRUE)
# Print the artist with the most songs
top_artist <- artist_song_count %>% slice(1)
print(top_artist)
## # A tibble: 1 × 2
## Artist n
## <chr> <int>
## 1 Bruno Mars 4
Interpretation: The top artist dominates streaming trends and may have multiple hit tracks in circulation simultaneously.
#(c) What percentage of songs belong to the top 5 record labels?
# Count the number of songs per record label
label_song_count <- data %>% count('Record Label', sort = TRUE)
# Get the top 5 record labels
top_5_labels <- label_song_count %>% head(5)
# Calculate the percentage of songs from these labels
total_songs <- nrow(data)
top_5_percentage <- sum(top_5_labels$n) / total_songs * 100
# Print the result
print(paste("Percentage of songs from top 5 record labels:", round(top_5_percentage, 2), "%"))
## [1] "Percentage of songs from top 5 record labels: 100 %"
Interpretation: A high percentage indicates industry consolidation, with a few major labels controlling the majority of popular content.
#(d)How do Spotify Streams compare to YouTube Views? (Are songs with high Spotify streams also popular on YouTube?
# Compute correlation between Spotify Streams and YouTube Views
correlation <- cor(data$`Spotify Streams`, data$`YouTube Views`, use = "complete.obs")
print(paste("Correlation between Spotify Streams and YouTube Views:", round(correlation, 2)))
## [1] "Correlation between Spotify Streams and YouTube Views: 0.43"
Interpretation: A strong positive correlation implies that hits on Spotify are likely to also perform well on YouTube, showing cross-platform popularity. A weak correlation might imply platform-specific user behavior.
#3. Grouping & Summarisation
#(a)What is the average number of streams per song?
# Calculate the average number of streams per song
avg_streams <- mean(data$`Spotify Streams`, na.rm = TRUE)
# Print the result
print(paste("Average number of streams per song:", round(avg_streams, 2)))
## [1] "Average number of streams per song: 447387314.75"
Interpretation: This average provides a benchmark for evaluating whether a track is outperforming the norm.
#(b).Which platform contributes the most to the overall track popularity?
# Sum engagement by platform
platform_totals <- data.frame(
Platform = c("Spotify", "YouTube", "TikTok", "Soundcloud", "Pandora"),
Total_Engagement = c(
sum(data$`Spotify Streams`, na.rm = TRUE),
sum(data$`YouTube Views`, na.rm = TRUE),
sum(data$`TikTok Views`, na.rm = TRUE),
sum(data$`Soundcloud Streams`, na.rm = TRUE),
sum(data$`Pandora Streams`, na.rm = TRUE)
)
)
# Sort by total engagement
platform_totals <- platform_totals %>% arrange(desc(Total_Engagement))
print(platform_totals)
## Platform Total_Engagement
## 1 TikTok 5.341326e+12
## 2 Spotify 2.057982e+12
## 3 YouTube 1.852865e+12
## 4 Pandora 3.940698e+11
## 5 Soundcloud 6.830065e+10
Interpretation: The platform with the highest total engagement is currently the most influential in music streaming culture.
#(c)Which artist has the highest combined social media reach (YouTube + TikTok + Spotify)?
# Calculate total social media reach
data <- data %>%
mutate(Social_Reach = `YouTube Views` + `TikTok Views` + `Spotify Playlist Reach`)
# Group by artist and sum their total reach
artist_reach <- data %>%
group_by(Artist) %>%
summarise(Total_Reach = sum(Social_Reach, na.rm = TRUE)) %>%
arrange(desc(Total_Reach))
# Display the artist with the highest total social reach
top_social_artist <- artist_reach %>% slice(1)
print(top_social_artist)
## # A tibble: 1 × 2
## Artist Total_Reach
## <chr> <dbl>
## 1 Kevin MacLeod 233245259334
Interpretation: This artist not only has musical popularity but also strong social presence, likely contributing to song virality.
#(d) What is the total number of streams per artist?
# Summarize total streams by artist
artist_streams <- data %>% group_by(Artist) %>% summarize(total_streams = sum(`Spotify Streams`, na.rm = TRUE))
# Print the result
print(artist_streams)
## # A tibble: 2,000 × 2
## Artist total_streams
## <chr> <dbl>
## 1 "\"XY\"" 447387315.
## 2 "$OHO BANI" 54065563
## 3 "$uicideboy$" 1697447430
## 4 "&ME" 34601626
## 5 "(G)I-DLE" 876938452
## 6 "*NSYNC" 69041864
## 7 ".diedlonely" 44235866
## 8 "10CM" 18800716
## 9 "13 Organis\xef\xbf" 267241905
## 10 "1da Banton" 154010050
## # ℹ 1,990 more rows
Interpretation: Artists with high total streams have sustained performance, indicating popularity over multiple tracks.
#4. Sorting and Ranking Data
#(a)Which songs stayed in the top 10 "All Time Rank" the longest?
# Count occurrences of songs in the top 10 of "All Time Rank"
top_10_songs <- data %>%
filter(`All Time Rank` <= 10) %>%
group_by(Track) %>%
summarise(Top10_Appearances = n()) %>%
arrange(desc(Top10_Appearances))
# Print the result
print(top_10_songs)
## # A tibble: 10 × 2
## Track Top10_Appearances
## <chr> <int>
## 1 BAND4BAND (feat. Lil Baby) 1
## 2 Beautiful Things 1
## 3 Danza Kuduro - Cover 1
## 4 Flowers 1
## 5 Gata Only 1
## 6 Houdini 1
## 7 Lovin On Me 1
## 8 MILLION DOLLAR BABY 1
## 9 Not Like Us 1
## 10 i like the way you kiss me 1
Interpretation: Songs frequently in the top 10 are highly resilient and have long-lasting appeal.
#(b)What is the correlation between YouTube Views and Spotify Streams?
# Calculate correlation between YouTube Views and Spotify Streams
correlation <- cor(data$`YouTube Views`, data$`Spotify Streams`, use = "complete.obs")
# Print correlation result
print(paste("Correlation between YouTube Views and Spotify Streams:", round(correlation, 2)))
## [1] "Correlation between YouTube Views and Spotify Streams: 0.43"
Interpretation: Confirms cross-platform impact; useful for predicting performance on one platform based on another.
#(c)Which song had the most consistent ranking across multiple platforms?
# Select ranking-related columns (adjust column names based on dataset)
ranking_columns <- c("Spotify Popularity", "YouTube Views", "TikTok Likes", "Apple Music Playlist Count")
# Calculate variance in rankings for each song
ranking_variability <- data %>%
rowwise() %>%
mutate(Rank_Variance = var(c_across(all_of(ranking_columns)), na.rm = TRUE)) %>%
select(Track, Rank_Variance) %>%
arrange(Rank_Variance)
# Print the song with the most consistent ranking (lowest variance)
print(ranking_variability %>% head(10))
## # A tibble: 10 × 2
## # Rowwise:
## Track Rank_Variance
## <chr> <dbl>
## 1 "Straight and Narrow" 5803046608.
## 2 "She���s Gone, Da" 7320406633.
## 3 "Talk talk" 9819019840.
## 4 "Magic Johnson" 13244746812.
## 5 "BBY BOO (REMIX)" 101036019237.
## 6 "I'm A Star - From \"Wish\"" 105663385861.
## 7 "Atmosphere" 133259941130
## 8 "High Road (feat. Jessie Murph)" 136332811523.
## 9 "Solid (feat. Drake)" 217338537016
## 10 "Give It To Me - Full Vocal Mix" 229144336433.
Interpretation: Low variance means the song performs equally well across platforms, which is rare and valuable.
#(d)Which songs were trending on both TikTok and Spotify simultaneously?
# Define the threshold for top 10% in both TikTok and Spotify
spotify_threshold <- quantile(data$`Spotify Streams`, 0.90, na.rm = TRUE)
tiktok_threshold <- quantile(data$`TikTok Posts`, 0.90, na.rm = TRUE)
# Filter songs that meet both conditions
trending_songs <- data %>%
filter(`Spotify Streams` >= spotify_threshold & `TikTok Posts` >= tiktok_threshold) %>%
select(Track, Artist, `Spotify Streams`, `TikTok Posts`)
# Print the result
print(trending_songs)
## # A tibble: 79 × 4
## Track Artist `Spotify Streams` `TikTok Posts`
## <chr> <chr> <dbl> <dbl>
## 1 Flowers Miley Cyrus 2031280633 7189811
## 2 greedy Tate McRae 1258569694 2294429
## 3 As It Was Harry Styles 3301814535 2755903
## 4 STAY (with Justin Bieber) The Kid LAROI 3107100349 7485966
## 5 Dance Monkey Tones And I 3071214106 10342366
## 6 Shape of You Ed Sheeran 3909458734 2270315
## 7 Blinding Lights The Weeknd 4281468720 2882064
## 8 Unholy (feat. Kim Petras) Sam Smith 1556275789 2379787
## 9 Me Porto Bonito Bad Bunny 1811990630 4506600
## 10 Perfect Ed Sheeran 2969999682 6642975
## # ℹ 69 more rows
Interpretation: These tracks are trendsetters ; strong on both user-generated content (TikTok) and passive listening (Spotify).
# 5. Feature Engineering
# (a). Composite Popularity Score (standardized sum of key metrics)
data <- data %>%
mutate(Composite_Popularity_Score = scale(`Spotify Streams`) +
scale(`YouTube Views`) +
scale(`TikTok Views`) +
scale(`Spotify Playlist Count`) +
scale(`Spotify Popularity`) +
scale(`Shazam Counts`))
# Print summary of new composite score
print(summary(data$Composite_Popularity_Score))
## V1
## Min. :-7.0735
## 1st Qu.:-2.1585
## Median :-0.9755
## Mean : 0.0000
## 3rd Qu.: 1.2137
## Max. :40.2469
Interpretation: A comprehensive score helps rank tracks holistically across platforms, smoothing out biases from individual metrics.
# (b) Engagement Ratios
data <- data %>%
mutate(
Views_per_Playlist = ifelse(`YouTube Playlist Reach` > 0,
`YouTube Views` / `YouTube Playlist Reach`, NA),
TikTok_Engagement_Rate = ifelse(`TikTok Posts` > 0,
`TikTok Likes` / `TikTok Posts`, NA),
Spotify_Reach_per_Stream = ifelse(`Spotify Streams` > 0,
`Spotify Playlist Reach` / `Spotify Streams`, NA)
)
# Print example rows for engagement ratios
print("First 5 rows of Engagement Ratios:")
## [1] "First 5 rows of Engagement Ratios:"
print(data %>% select(Views_per_Playlist, TikTok_Engagement_Rate, Spotify_Reach_per_Stream) %>% head(5))
## # A tibble: 5 × 3
## Views_per_Playlist TikTok_Engagement_Rate Spotify_Reach_per_Stream
## <dbl> <dbl> <dbl>
## 1 0.560 113. 0.504
## 2 0.744 52.2 0.539
## 3 0.328 90.9 0.352
## 4 0.327 150. 0.0672
## 5 0.686 6868. 1.42
Interpretation:Ratios like Views per Playlist, TikTok Engagement Rate help measure audience interaction efficiency rather than just raw numbers.
# (c). Cross-Platform Popularity
data <- data %>%
mutate(Cross_Platform_Popularity = `Spotify Streams` + `YouTube Views` + `TikTok Views` + `Pandora Streams` + `Soundcloud Streams`)
# Print top songs by Cross-Platform Popularity
top_cross_platform_songs <- data %>%
arrange(desc(Cross_Platform_Popularity)) %>%
select(Track, Artist, Cross_Platform_Popularity)
print("Top songs by Cross-Platform Popularity:")
## [1] "Top songs by Cross-Platform Popularity:"
print(head(top_cross_platform_songs, 10))
## # A tibble: 10 × 3
## Track Artist Cross_Platform_Popularity
## <chr> <chr> <dbl>
## 1 Monkeys Spinning Monkeys Kevin MacLeod 233355761424.
## 2 Love You So The King Khan & BBQ Show 214882841953.
## 3 Oh No Kreepa 61162058226.
## 4 Funny Song Cavendish Music 38406176345.
## 5 Aesthetic Tollan Kim 33894494044.
## 6 Spongebob Dante9k 33777959787.
## 7 She Share Story Shayne Orok 33664390781.
## 8 STAY (with Justin Bieber) The Kid LAROI 28309576032
## 9 Pieces Danilo Stankovic 28138961047.
## 10 love nwantiti (ah ah ah) CKay 25965134049.
# Final structure of the dataset
print("Updated structure of dataset:")
## [1] "Updated structure of dataset:"
str(data)
## tibble [4,600 × 34] (S3: tbl_df/tbl/data.frame)
## $ Track : chr [1:4600] "MILLION DOLLAR BABY" "Not Like Us" "i like the way you kiss me" "Flowers" ...
## $ Album Name : chr [1:4600] "Million Dollar Baby - Single" "Not Like Us" "I like the way you kiss me" "Flowers - Single" ...
## $ Artist : chr [1:4600] "Tommy Richman" "Kendrick Lamar" "Artemas" "Miley Cyrus" ...
## $ Release Date : chr [1:4600] "4/26/2024" "5/4/2024" "3/19/2024" "1/12/2023" ...
## $ ISRC : chr [1:4600] "QM24S2402528" "USUG12400910" "QZJ842400387" "USSM12209777" ...
## $ All Time Rank : num [1:4600] 1 2 3 4 5 6 7 8 9 10 ...
## $ Track Score : num [1:4600] 725 546 538 445 423 ...
## $ Spotify Streams : num [1:4600] 3.90e+08 3.24e+08 6.01e+08 2.03e+09 1.07e+08 ...
## $ Spotify Playlist Count : num [1:4600] 30716 28113 54331 269802 7223 ...
## $ Spotify Playlist Reach : num [1:4600] 1.97e+08 1.75e+08 2.12e+08 1.37e+08 1.51e+08 ...
## $ Spotify Popularity : num [1:4600] 92 92 92 85 88 ...
## $ YouTube Views : num [1:4600] 8.43e+07 1.16e+08 1.23e+08 1.10e+09 7.74e+07 ...
## $ YouTube Likes : num [1:4600] 1713126 3486739 2228730 10629796 3670188 ...
## $ TikTok Posts : num [1:4600] 5767700 674700 3025400 7189811 16400 ...
## $ TikTok Likes : num [1:4600] 6.52e+08 3.52e+07 2.75e+08 1.08e+09 1.13e+08 ...
## $ TikTok Views : num [1:4600] 5.33e+09 2.08e+08 3.37e+09 1.46e+10 1.16e+09 ...
## $ YouTube Playlist Reach : num [1:4600] 1.51e+08 1.56e+08 3.74e+08 3.35e+09 1.13e+08 ...
## $ Apple Music Playlist Count: num [1:4600] 210 188 190 394 182 ...
## $ AirPlay Spins : num [1:4600] 40975 40778 74333 1474799 12185 ...
## $ SiriusXM Spins : num [1:4600] 684 3 536 2182 1 ...
## $ Deezer Playlist Count : num [1:4600] 62 67 136 264 82 ...
## $ Deezer Playlist Reach : num [1:4600] 17598718 10422430 36321847 24684248 17660624 ...
## $ Amazon Playlist Count : num [1:4600] 114 111 172 210 105 ...
## $ Pandora Streams : num [1:4600] 1.80e+07 7.78e+06 5.02e+06 1.90e+08 4.49e+06 ...
## $ Pandora Track Stations : num [1:4600] 22931 28444 5639 203384 7006 ...
## $ Soundcloud Streams : num [1:4600] 4818457 6623075 7208651 14847968 207179 ...
## $ Shazam Counts : num [1:4600] 2669262 1118279 5285340 11822942 457017 ...
## $ Explicit Track : num [1:4600] 0 1 0 0 1 1 0 1 1 1 ...
## $ Social_Reach : num [1:4600] 5.61e+09 4.99e+08 3.70e+09 1.58e+10 1.39e+09 ...
## $ Composite_Popularity_Score: num [1:4600, 1] 1.78 0.408 2.654 12.667 -0.56 ...
## ..- attr(*, "scaled:center")= num 4.47e+08
## ..- attr(*, "scaled:scale")= num 5.32e+08
## $ Views_per_Playlist : num [1:4600] 0.56 0.744 0.328 0.327 0.686 ...
## $ TikTok_Engagement_Rate : num [1:4600] 113 52.2 90.9 150 6868.1 ...
## $ Spotify_Reach_per_Stream : num [1:4600] 0.5036 0.5394 0.3519 0.0672 1.4151 ...
## $ Cross_Platform_Popularity : num [1:4600] 5.83e+09 6.63e+08 4.11e+09 1.79e+10 1.35e+09 ...
Interpretation:The table highlights the top songs that have achieved the highest combined popularity across major streaming platforms, reflecting strong cross-platform audience engagement and widespread digital reach
10 data analysis questions
(a)Does the “Explicit Track” have a significant effect on the “YouTube Views”? Hypotheses: Null Hypothesis (H₀): There is no significant difference in YouTube Views between Explicit and Non-Explicit tracks.
Alternative Hypothesis (H₁): There is a significant difference in YouTube Views between Explicit and Non-Explicit tracks.
# Ensure "Explicit Track" is a factor (categorical variable)
data$`Explicit Track` <- as.factor(data$`Explicit Track`)
# Perform ANOVA
anova_result <- aov(`YouTube Views` ~ `Explicit Track`, data = data)
# Display the summary of the ANOVA result
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## `Explicit Track` 1 4.600e+19 4.600e+19 102.3 <2e-16 ***
## Residuals 4598 2.068e+21 4.498e+17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# box plot to visualize YouTube Views by Explicit Track category
ggplot(data, aes(x = `Explicit Track`, y = `YouTube Views`, fill = `Explicit Track`)) +
geom_boxplot() +
labs(title = "YouTube Views by Explicit Track",
x = "Explicit Track",
y = "YouTube Views") +
theme_minimal() +
scale_fill_manual(values = c("skyblue", "salmon")) # Customize colors
Interpretation:Since the p-value is very small, we can conclude that there is a significant difference in YouTube Views based on whether the track is marked as explicit or not. In other words, whether a track is explicit or not does have an effect on the number of YouTube views.
b.What is the relationship between “Spotify Streams” and “YouTube Views”?
Hypothesis: There is a linear relationship between the number of Spotify streams and YouTube views.
# Perform linear regression
linear_model <- lm(`YouTube Views` ~ `Spotify Streams`, data = data)
# Summary of the linear model
summary(linear_model)
##
## Call:
## lm(formula = `YouTube Views` ~ `Spotify Streams`, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.112e+09 -2.228e+08 -1.466e+08 6.210e+07 1.577e+10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.551e+08 1.177e+07 13.18 <2e-16 ***
## `Spotify Streams` 5.537e-01 1.694e-02 32.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 610800000 on 4598 degrees of freedom
## Multiple R-squared: 0.1886, Adjusted R-squared: 0.1884
## F-statistic: 1069 on 1 and 4598 DF, p-value: < 2.2e-16
# Create a scatter plot with the regression line
ggplot(data, aes(x = `Spotify Streams`, y = `YouTube Views`)) +
geom_point(color = "blue") + # Scatter plot points
geom_smooth(method = "lm", col = "red") + # Add regression line
labs(title = "Linear Relationship between Spotify Streams and YouTube Views",
x = "Spotify Streams",
y = "YouTube Views") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:While Spotify Streams do predict YouTube Views significantly, the relationship is weak, indicating that other factors likely influence YouTube Views too.
# Calculate correlation
correlation_tiktok_streams <- cor(data$`TikTok Views`, data$`Spotify Streams`, use = "complete.obs")
print(paste("Correlation between TikTok Views and Spotify Streams:", round(correlation_tiktok_streams, 2)))
## [1] "Correlation between TikTok Views and Spotify Streams: 0.03"
# Scatter plot to visualize the relationship
ggplot(data, aes(x = `TikTok Views`, y = `Spotify Streams`)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Correlation between TikTok Views and Spotify Streams",
x = "TikTok Views",
y = "Spotify Streams")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:The correlation coefficient between TikTok Views and Spotify Streams is 0.03, indicating a very weak positive relationship, suggesting that TikTok popularity has almost no linear influence on Spotify streaming numbers.
# Calculate correlation
correlation_tiktok_posts_likes <- cor(data$`TikTok Posts`, data$`TikTok Likes`, use = "complete.obs")
print(paste("Correlation between TikTok Posts and TikTok Likes:", round(correlation_tiktok_posts_likes, 2)))
## [1] "Correlation between TikTok Posts and TikTok Likes: 0.5"
# Scatter plot to visualize the relationship
ggplot(data, aes(x = `TikTok Posts`, y = `TikTok Likes`)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Correlation between TikTok Posts and TikTok Likes",
x = "TikTok Posts",
y = "TikTok Likes")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:The correlation between TikTok Posts and TikTok Likes is 0.5, indicating a moderate positive relationship, meaning more posts are generally associated with more likes, but the relationship is not very strong.
# Linear regression model
model_youtube_streams <- lm(`Spotify Streams` ~ `YouTube Views`, data = data)
summary(model_youtube_streams)
##
## Call:
## lm(formula = `Spotify Streams` ~ `YouTube Views`, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.149e+09 -2.762e+08 -1.467e+08 1.466e+08 3.814e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.102e+08 8.216e+06 37.75 <2e-16 ***
## `YouTube Views` 3.406e-01 1.042e-02 32.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 479100000 on 4598 degrees of freedom
## Multiple R-squared: 0.1886, Adjusted R-squared: 0.1884
## F-statistic: 1069 on 1 and 4598 DF, p-value: < 2.2e-16
# Scatter plot with regression line
ggplot(data, aes(x = `YouTube Views`, y = `Spotify Streams`)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
labs(title = "YouTube Views Predicting Spotify Streams",
x = "YouTube Views",
y = "Spotify Streams")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:YouTube Views have a significant positive effect on Spotify Streams (p < 0.001). For every additional YouTube view, Spotify Streams increase by about 0.34. However, YouTube Views explain only 18.86% of the variation in Spotify Streams, indicating that other factors also influence Spotify success. The model is statistically significant but has a weak to moderate fit.
# Linear regression model
model_playlist_streams <- lm(`Spotify Streams` ~ `Spotify Playlist Count`, data = data)
summary(model_playlist_streams)
##
## Call:
## lm(formula = `Spotify Streams` ~ `Spotify Playlist Count`, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.052e+09 -1.171e+08 -7.185e+07 4.792e+07 3.814e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.055e+07 6.183e+06 14.64 <2e-16 ***
## `Spotify Playlist Count` 6.008e+03 6.703e+01 89.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 320900000 on 4598 degrees of freedom
## Multiple R-squared: 0.636, Adjusted R-squared: 0.636
## F-statistic: 8035 on 1 and 4598 DF, p-value: < 2.2e-16
ggplot(data, aes(x = `Spotify Playlist Count`, y = `Spotify Streams`)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Effect of Spotify Playlist Count on Spotify Streams",
x = "Spotify Playlist Count",
y = "Spotify Streams")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation-The regression analysis shows that YouTube Views have a positive relationship with Spotify Streams. For every additional YouTube view, Spotify Streams increase by approximately 0.34. The model is statistically significant, but the weak R-squared value of 0.1886 indicates that YouTube Views explain only about 19% of the variation in Spotify Streams, suggesting other factors are at play.
g.How does the track score influence Spotify popularity?
model_score_popularity <- lm(`Spotify Popularity` ~ `Track Score`, data = data)
summary(model_score_popularity)
##
## Call:
## lm(formula = `Spotify Popularity` ~ `Track Score`, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.455 -1.170 1.608 7.750 25.909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.650527 0.314895 192.60 <2e-16 ***
## `Track Score` 0.068135 0.005535 12.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.47 on 4598 degrees of freedom
## Multiple R-squared: 0.0319, Adjusted R-squared: 0.03169
## F-statistic: 151.5 on 1 and 4598 DF, p-value: < 2.2e-16
# Scatter plot with regression line
ggplot(data, aes(x = `Track Score`, y = `Spotify Popularity`)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Track Score Influencing Spotify Popularity",
x = "Track Score",
y = "Spotify Popularity")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation-The regression analysis shows a positive relationship between Track Score and Spotify Popularity. For each unit increase in Track Score, Spotify Popularity increases by approximately 0.068. The model is statistically significant, but the R-squared value of 0.0319 indicates that Track Score explains only about 3.2% of the variation in Spotify Popularity, suggesting other factors contribute to its popularity.
h.How do Spotify Playlist Counts vary by Artist?
top_artists <- data %>%
group_by(Artist) %>%
summarise(Total_Playlist_Count = sum(`Spotify Playlist Count`, na.rm = TRUE)) %>%
arrange(desc(Total_Playlist_Count)) %>%
slice(1:10)
ggplot(top_artists, aes(x = reorder(Artist, -Total_Playlist_Count), y = Total_Playlist_Count)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Top 10 Artists by Spotify Playlist Count", x = "Artist", y = "Total Playlist Count") +
coord_flip()
Interpretation:The plot displays the top 10 artists with the highest
total playlist counts on Spotify. The artists are represented on the
y-axis, and their respective total playlist counts are shown on the
x-axis. The bars are oriented horizontally, making it easier to compare
the total playlist count for each artist. The color of the bars is steel
blue.
i.How is the distribution of Track Scores across all songs?
# Plot histogram for Track Score
ggplot(data, aes(x = `Track Score`)) +
geom_histogram(binwidth = 10, fill = "steelblue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Track Scores",
x = "Track Score",
y = "Number of Tracks") +
theme_minimal()
Interpretation-The histogram shows the distribution of track scores, with scores grouped into intervals of 10. The bars represent the frequency of tracks within each score range, giving an overview of how track scores are spread across the dataset.
j.distribuition of total digital presence
# Calculate the total sums for each important platform column
platform_totals <- c(
"Spotify Streams" = sum(data$`Spotify Streams`, na.rm = TRUE),
"YouTube Views" = sum(data$`YouTube Views`, na.rm = TRUE),
"TikTok Views" = sum(data$`TikTok Views`, na.rm = TRUE),
"AirPlay Spins" = sum(data$`AirPlay Spins`, na.rm = TRUE),
"SiriusXM Spins" = sum(data$`SiriusXM Spins`, na.rm = TRUE),
"Deezer Playlist Reach" = sum(data$`Deezer Playlist Reach`, na.rm = TRUE),
"Amazon Playlist Count" = sum(data$`Amazon Playlist Count`, na.rm = TRUE),
"Pandora Streams" = sum(data$`Pandora Streams`, na.rm = TRUE),
"Soundcloud Streams" = sum(data$`Soundcloud Streams`, na.rm = TRUE),
"Shazam Counts" = sum(data$`Shazam Counts`, na.rm = TRUE)
)
# Convert totals to a data frame for plotting
platform_df <- data.frame(
Platform = names(platform_totals),
Total = as.numeric(platform_totals)
)
# Create the pie chart
ggplot(platform_df, aes(x = "", y = Total, fill = Platform)) +
geom_col(width = 1, color = "white") +
coord_polar(theta = "y") +
labs(
title = " Distribution of Total Digital Presence",
fill = "Platform"
) +
theme_void() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
legend.title = element_text(face = "bold"),
legend.text = element_text(size = 10)
) +
scale_fill_brewer(palette = "Paired")
Interpretation-Tiktok Views have the most distribution of total digital presence
The findings from the Spotify Songs 2024 dataset analysis would be highly valuable for several key groups:
1.Record Labels and Music Producers: To identify successful artists, understand listener behavior across platforms, and develop targeted marketing strategies to maximize song popularity and revenue.
2.Artists and Music Managers: To assess their performance relative to competitors, find opportunities to expand reach across platforms like TikTok and YouTube, and focus on areas that drive engagement and streaming numbers.
3.Marketing and Advertising Agencies: To design promotional campaigns based on insights into where and how audiences engage with music most actively.
4.Streaming Platforms (Spotify, YouTube, TikTok, etc.): To refine recommendation algorithms, develop partnerships with trending artists, and improve user experience by aligning content with listener preferences.
5.Music Analysts and Researchers: To explore industry trends, study platform correlations, and forecast future music consumption patterns.
6.Event Organizers and Concert Promoters: To choose trending artists for events and understand the kind of music that is currently resonating with large audiences.
7.Investors and Talent Scouts: To make data-driven decisions about which emerging artists or songs to invest in for future growth and returns.
Conclusion
The analysis of the Spotify Songs 2024 dataset provided comprehensive insights into streaming trends, artist popularity, and cross-platform engagement patterns. Data preprocessing successfully ensured the reliability and consistency of the dataset by handling missing values, removing duplicates, and performing necessary data type conversions. Through extraction and filtering, key observations such as the identification of high-streaming songs, dominant artists, and the influence of major record labels were made. The study revealed significant correlations between Spotify streams and YouTube views, highlighting strong platform interconnectivity.
Grouping and summarization techniques offered a deeper understanding of platform popularity and social media reach, helping to identify artists with substantial cross-platform influence. Sorting and ranking processes uncovered songs with enduring success in top rankings and emphasized consistency in performance across multiple platforms. Feature engineering enriched the dataset by creating new variables like Composite Popularity Scores and engagement ratios, allowing a more nuanced measurement of song and artist performance.
Statistical analyses, including ANOVA and regression modeling, provided evidence of significant differences in genre-wise streams and established predictive relationships between artist popularity and streaming numbers. Visualization through ggplot2 enabled clear depiction of critical patterns, such as the distribution of Spotify streams by genre and the correlation between artist popularity and streaming figures.
Overall, the analysis not only strengthened the understanding of music streaming behavior in 2024 but also highlighted the importance of cross-platform dynamics and multi-faceted engagement metrics in evaluating musical success. These insights can assist record labels, artists, and marketers in developing more data-driven strategies for promoting music and optimizing audience reach across digital platforms.