PROJECT OVERVIEW : This project explores the “Most Streamed Spotify Songs of 2024” dataset to uncover trends, insights, and patterns in music streaming across multiple platforms like Spotify, YouTube, TikTok, SoundCloud, and others. Given the rapidly evolving landscape of music consumption, analyzing such a dataset offers valuable insights into which songs and artists are dominating across platforms and what factors contribute most to their popularity.The dataset includes detailed metrics such as stream counts, playlist reach, social engagement, and rank history across platforms. It also includes metadata like release dates, artists, genres, and record label

Objectives:

1. Data Cleaning and Preprocessing: To systematically clean and preprocess the dataset by addressing missing values, removing duplicate entries, and ensuring that all data types are correctly formatted for accurate analysis.

2. Exploratory Data Analysis: To explore and analyze key streaming metrics, including Spotify streams, YouTube views, and TikTok engagement, in order to identify patterns, trends, and insights related to song performance.

3. Artist and Record Label Evaluation: To assess the influence of artists and record labels by identifying the most successful contributors based on streaming volume, social media presence, and overall impact on the music landscape

4. Cross-Platform Performance Comparison: To compare performance metrics across multiple platforms and investigate the relationships and correlations between them, particularly focusing on the connection between YouTube views and Spotify streams.

5. Identification of Top and Consistent Performers: To identify top-ranking songs and those that consistently trend across different platforms over time, highlighting tracks that maintain popularity and engagement.

6. Aggregation and Summarization of Streaming Data: To aggregate and summarize streaming data by artist and platform, providing a comprehensive understanding of audience engagement, content reach, and market dynamics.

7. Feature Engineering for Enhanced Insights: To create meaningful new features, such as composite popularity scores, engagement ratios, and platform diversity metrics, thereby enriching the dataset for deeper insights and advanced analytical modeling.

#1.loading the data set

data <- read_csv("C:/Users/ACER/Desktop/Most Streamed Spotify Songs 2024.csv")
## Rows: 4600 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Track, Album Name, Artist, Release Date, ISRC
## dbl  (6): Track Score, Spotify Popularity, Apple Music Playlist Count, Deeze...
## num (17): All Time Rank, Spotify Streams, Spotify Playlist Count, Spotify Pl...
## lgl  (1): TIDAL Popularity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#1.Understanding the data set 

# Remove columns that contain all NA values
data <- data[, colSums(is.na(data)) < nrow(data)]
# Fill NA values in numeric columns with their respective mean values
numeric_columns <- names(data)[sapply(data, is.numeric)]
for (col in numeric_columns) {
  data[[col]][is.na(data[[col]])] <- mean(data[[col]], na.rm = TRUE)
}
# Check if missing values have been handled
missing_values <- colSums(is.na(data))
print("Missing values per column after handling:")
## [1] "Missing values per column after handling:"
print(missing_values)
##                      Track                 Album Name 
##                          0                          0 
##                     Artist               Release Date 
##                          5                          0 
##                       ISRC              All Time Rank 
##                          0                          0 
##                Track Score            Spotify Streams 
##                          0                          0 
##     Spotify Playlist Count     Spotify Playlist Reach 
##                          0                          0 
##         Spotify Popularity              YouTube Views 
##                          0                          0 
##              YouTube Likes               TikTok Posts 
##                          0                          0 
##               TikTok Likes               TikTok Views 
##                          0                          0 
##     YouTube Playlist Reach Apple Music Playlist Count 
##                          0                          0 
##              AirPlay Spins             SiriusXM Spins 
##                          0                          0 
##      Deezer Playlist Count      Deezer Playlist Reach 
##                          0                          0 
##      Amazon Playlist Count            Pandora Streams 
##                          0                          0 
##     Pandora Track Stations         Soundcloud Streams 
##                          0                          0 
##              Shazam Counts             Explicit Track 
##                          0                          0
#a.Column names are very important as they provide context and meaning to the data contained within each column.
print(colnames(data))
##  [1] "Track"                      "Album Name"                
##  [3] "Artist"                     "Release Date"              
##  [5] "ISRC"                       "All Time Rank"             
##  [7] "Track Score"                "Spotify Streams"           
##  [9] "Spotify Playlist Count"     "Spotify Playlist Reach"    
## [11] "Spotify Popularity"         "YouTube Views"             
## [13] "YouTube Likes"              "TikTok Posts"              
## [15] "TikTok Likes"               "TikTok Views"              
## [17] "YouTube Playlist Reach"     "Apple Music Playlist Count"
## [19] "AirPlay Spins"              "SiriusXM Spins"            
## [21] "Deezer Playlist Count"      "Deezer Playlist Reach"     
## [23] "Amazon Playlist Count"      "Pandora Streams"           
## [25] "Pandora Track Stations"     "Soundcloud Streams"        
## [27] "Shazam Counts"              "Explicit Track"

Interpretation: The dataset contains multiple metrics measuring a song’s performance across different platforms, along with metadata like genre, artist, and label.

#b.Are there any missing or duplicate values in the dataset? 
# Check for missing values
missing_values <- colSums(is.na(data))
print(missing_values)
##                      Track                 Album Name 
##                          0                          0 
##                     Artist               Release Date 
##                          5                          0 
##                       ISRC              All Time Rank 
##                          0                          0 
##                Track Score            Spotify Streams 
##                          0                          0 
##     Spotify Playlist Count     Spotify Playlist Reach 
##                          0                          0 
##         Spotify Popularity              YouTube Views 
##                          0                          0 
##              YouTube Likes               TikTok Posts 
##                          0                          0 
##               TikTok Likes               TikTok Views 
##                          0                          0 
##     YouTube Playlist Reach Apple Music Playlist Count 
##                          0                          0 
##              AirPlay Spins             SiriusXM Spins 
##                          0                          0 
##      Deezer Playlist Count      Deezer Playlist Reach 
##                          0                          0 
##      Amazon Playlist Count            Pandora Streams 
##                          0                          0 
##     Pandora Track Stations         Soundcloud Streams 
##                          0                          0 
##              Shazam Counts             Explicit Track 
##                          0                          0
# Check for duplicate rows
duplicate_rows <- sum(duplicated(data))
print(paste("Number of duplicate rows:", duplicate_rows))
## [1] "Number of duplicate rows: 2"

Interpretation: The dataset had missing values which have been handled. There were also some duplicate entries that were identified. Cleaning ensures more reliable analysis

#c.Convert the date column to Date format (assuming the column is named 'date')
if ("date" %in% colnames(data)) {
  data$date <- as.Date(data$date, format="%Y-%m-%d")
  
  # Find the time range
  time_range <- range(data$date, na.rm = TRUE)
  print(time_range)
} else {
  print("No 'date' column found in the dataset.")
}
## [1] "No 'date' column found in the dataset."

Interpretation: If the date column exists, it gives insight into the song’s lifespan. The dataset covers a specific period, useful for trend analysis. If not found, this step might need revisiting.

#d. Display data types and structure of the dataset
print("Dataset structure:")
## [1] "Dataset structure:"
str(data)
## tibble [4,600 × 28] (S3: tbl_df/tbl/data.frame)
##  $ Track                     : chr [1:4600] "MILLION DOLLAR BABY" "Not Like Us" "i like the way you kiss me" "Flowers" ...
##  $ Album Name                : chr [1:4600] "Million Dollar Baby - Single" "Not Like Us" "I like the way you kiss me" "Flowers - Single" ...
##  $ Artist                    : chr [1:4600] "Tommy Richman" "Kendrick Lamar" "Artemas" "Miley Cyrus" ...
##  $ Release Date              : chr [1:4600] "4/26/2024" "5/4/2024" "3/19/2024" "1/12/2023" ...
##  $ ISRC                      : chr [1:4600] "QM24S2402528" "USUG12400910" "QZJ842400387" "USSM12209777" ...
##  $ All Time Rank             : num [1:4600] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Track Score               : num [1:4600] 725 546 538 445 423 ...
##  $ Spotify Streams           : num [1:4600] 3.90e+08 3.24e+08 6.01e+08 2.03e+09 1.07e+08 ...
##  $ Spotify Playlist Count    : num [1:4600] 30716 28113 54331 269802 7223 ...
##  $ Spotify Playlist Reach    : num [1:4600] 1.97e+08 1.75e+08 2.12e+08 1.37e+08 1.51e+08 ...
##  $ Spotify Popularity        : num [1:4600] 92 92 92 85 88 ...
##  $ YouTube Views             : num [1:4600] 8.43e+07 1.16e+08 1.23e+08 1.10e+09 7.74e+07 ...
##  $ YouTube Likes             : num [1:4600] 1713126 3486739 2228730 10629796 3670188 ...
##  $ TikTok Posts              : num [1:4600] 5767700 674700 3025400 7189811 16400 ...
##  $ TikTok Likes              : num [1:4600] 6.52e+08 3.52e+07 2.75e+08 1.08e+09 1.13e+08 ...
##  $ TikTok Views              : num [1:4600] 5.33e+09 2.08e+08 3.37e+09 1.46e+10 1.16e+09 ...
##  $ YouTube Playlist Reach    : num [1:4600] 1.51e+08 1.56e+08 3.74e+08 3.35e+09 1.13e+08 ...
##  $ Apple Music Playlist Count: num [1:4600] 210 188 190 394 182 ...
##  $ AirPlay Spins             : num [1:4600] 40975 40778 74333 1474799 12185 ...
##  $ SiriusXM Spins            : num [1:4600] 684 3 536 2182 1 ...
##  $ Deezer Playlist Count     : num [1:4600] 62 67 136 264 82 ...
##  $ Deezer Playlist Reach     : num [1:4600] 17598718 10422430 36321847 24684248 17660624 ...
##  $ Amazon Playlist Count     : num [1:4600] 114 111 172 210 105 ...
##  $ Pandora Streams           : num [1:4600] 1.80e+07 7.78e+06 5.02e+06 1.90e+08 4.49e+06 ...
##  $ Pandora Track Stations    : num [1:4600] 22931 28444 5639 203384 7006 ...
##  $ Soundcloud Streams        : num [1:4600] 4818457 6623075 7208651 14847968 207179 ...
##  $ Shazam Counts             : num [1:4600] 2669262 1118279 5285340 11822942 457017 ...
##  $ Explicit Track            : num [1:4600] 0 1 0 0 1 1 0 1 1 1 ...
#Identifying and apply necessary data transformations
print("Applying necessary transformations...")
## [1] "Applying necessary transformations..."
# Convert character columns that should be categorical (factor)
categorical_columns <- c("artist", "genre", "region", "label")
for (col in categorical_columns) {
  if (col %in% colnames(data)) {
    data[[col]] <- as.factor(data[[col]])
    print(paste("Converted", col, "to factor (categorical)."))
  }
}
# Convert numeric columns stored as characters to numeric
numeric_columns <- c("streams", "daily_streams", "social_mentions", "artist_popularity")
for (col in numeric_columns) {
  if (col %in% colnames(data)) {
    data[[col]] <- as.numeric(gsub(",", "", data[[col]]))  # Remove commas before conversion
    print(paste("Converted", col, "to numeric."))
  }
}
# Final check on dataset structure after transformations
print("Updated dataset structure:")
## [1] "Updated dataset structure:"
str(data)
## tibble [4,600 × 28] (S3: tbl_df/tbl/data.frame)
##  $ Track                     : chr [1:4600] "MILLION DOLLAR BABY" "Not Like Us" "i like the way you kiss me" "Flowers" ...
##  $ Album Name                : chr [1:4600] "Million Dollar Baby - Single" "Not Like Us" "I like the way you kiss me" "Flowers - Single" ...
##  $ Artist                    : chr [1:4600] "Tommy Richman" "Kendrick Lamar" "Artemas" "Miley Cyrus" ...
##  $ Release Date              : chr [1:4600] "4/26/2024" "5/4/2024" "3/19/2024" "1/12/2023" ...
##  $ ISRC                      : chr [1:4600] "QM24S2402528" "USUG12400910" "QZJ842400387" "USSM12209777" ...
##  $ All Time Rank             : num [1:4600] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Track Score               : num [1:4600] 725 546 538 445 423 ...
##  $ Spotify Streams           : num [1:4600] 3.90e+08 3.24e+08 6.01e+08 2.03e+09 1.07e+08 ...
##  $ Spotify Playlist Count    : num [1:4600] 30716 28113 54331 269802 7223 ...
##  $ Spotify Playlist Reach    : num [1:4600] 1.97e+08 1.75e+08 2.12e+08 1.37e+08 1.51e+08 ...
##  $ Spotify Popularity        : num [1:4600] 92 92 92 85 88 ...
##  $ YouTube Views             : num [1:4600] 8.43e+07 1.16e+08 1.23e+08 1.10e+09 7.74e+07 ...
##  $ YouTube Likes             : num [1:4600] 1713126 3486739 2228730 10629796 3670188 ...
##  $ TikTok Posts              : num [1:4600] 5767700 674700 3025400 7189811 16400 ...
##  $ TikTok Likes              : num [1:4600] 6.52e+08 3.52e+07 2.75e+08 1.08e+09 1.13e+08 ...
##  $ TikTok Views              : num [1:4600] 5.33e+09 2.08e+08 3.37e+09 1.46e+10 1.16e+09 ...
##  $ YouTube Playlist Reach    : num [1:4600] 1.51e+08 1.56e+08 3.74e+08 3.35e+09 1.13e+08 ...
##  $ Apple Music Playlist Count: num [1:4600] 210 188 190 394 182 ...
##  $ AirPlay Spins             : num [1:4600] 40975 40778 74333 1474799 12185 ...
##  $ SiriusXM Spins            : num [1:4600] 684 3 536 2182 1 ...
##  $ Deezer Playlist Count     : num [1:4600] 62 67 136 264 82 ...
##  $ Deezer Playlist Reach     : num [1:4600] 17598718 10422430 36321847 24684248 17660624 ...
##  $ Amazon Playlist Count     : num [1:4600] 114 111 172 210 105 ...
##  $ Pandora Streams           : num [1:4600] 1.80e+07 7.78e+06 5.02e+06 1.90e+08 4.49e+06 ...
##  $ Pandora Track Stations    : num [1:4600] 22931 28444 5639 203384 7006 ...
##  $ Soundcloud Streams        : num [1:4600] 4818457 6623075 7208651 14847968 207179 ...
##  $ Shazam Counts             : num [1:4600] 2669262 1118279 5285340 11822942 457017 ...
##  $ Explicit Track            : num [1:4600] 0 1 0 0 1 1 0 1 1 1 ...

Interpretation: Proper data types are essential for accurate analysis. This step ensures categorical and numerical columns are correctly processed for statistical operations.

# 2. Data Extraction & Filtering
#(a) How many songs have more than 100 million streams?
# Filter songs with more than 100 million streams
high_stream_songs <- data %>% filter(`Spotify Streams` > 100000000)
# Count the number of such songs
num_high_stream_songs <- nrow(high_stream_songs)
# Print the result
print(paste("Number of songs with more than 100 million streams:", num_high_stream_songs))
## [1] "Number of songs with more than 100 million streams: 3226"

Interpretation: These songs represent the viral hits of 2024, indicating strong listener engagement and possibly wide demographic appeal.

#(b) Which artist has the most songs in the top 100 streamed songs?
# Select the top 100 streamed songs
top_100_songs <- data %>% arrange(desc(`Spotify Streams`)) %>% head(100)
# Count the number of songs per artist
artist_song_count <- top_100_songs %>% count(Artist, sort = TRUE)
# Print the artist with the most songs
top_artist <- artist_song_count %>% slice(1)
print(top_artist)
## # A tibble: 1 × 2
##   Artist         n
##   <chr>      <int>
## 1 Bruno Mars     4

Interpretation: The top artist dominates streaming trends and may have multiple hit tracks in circulation simultaneously.

#(c) What percentage of songs belong to the top 5 record labels?
# Count the number of songs per record label
label_song_count <- data %>% count('Record Label', sort = TRUE)
# Get the top 5 record labels
top_5_labels <- label_song_count %>% head(5)

# Calculate the percentage of songs from these labels
total_songs <- nrow(data)
top_5_percentage <- sum(top_5_labels$n) / total_songs * 100
# Print the result
print(paste("Percentage of songs from top 5 record labels:", round(top_5_percentage, 2), "%"))
## [1] "Percentage of songs from top 5 record labels: 100 %"

Interpretation: A high percentage indicates industry consolidation, with a few major labels controlling the majority of popular content.

#(d)How do Spotify Streams compare to YouTube Views? (Are songs with high Spotify streams also popular on YouTube?

# Compute correlation between Spotify Streams and YouTube Views
correlation <- cor(data$`Spotify Streams`, data$`YouTube Views`, use = "complete.obs")
print(paste("Correlation between Spotify Streams and YouTube Views:", round(correlation, 2)))
## [1] "Correlation between Spotify Streams and YouTube Views: 0.43"

Interpretation: A strong positive correlation implies that hits on Spotify are likely to also perform well on YouTube, showing cross-platform popularity. A weak correlation might imply platform-specific user behavior.

#3. Grouping & Summarisation
#(a)What is the average number of streams per song? 
# Calculate the average number of streams per song
avg_streams <- mean(data$`Spotify Streams`, na.rm = TRUE)

# Print the result
print(paste("Average number of streams per song:", round(avg_streams, 2)))
## [1] "Average number of streams per song: 447387314.75"

Interpretation: This average provides a benchmark for evaluating whether a track is outperforming the norm.

#(b).Which platform contributes the most to the overall track popularity?
# Sum engagement by platform
platform_totals <- data.frame(
  Platform = c("Spotify", "YouTube", "TikTok", "Soundcloud", "Pandora"),
  Total_Engagement = c(
    sum(data$`Spotify Streams`, na.rm = TRUE),
    sum(data$`YouTube Views`, na.rm = TRUE),
    sum(data$`TikTok Views`, na.rm = TRUE),
    sum(data$`Soundcloud Streams`, na.rm = TRUE),
    sum(data$`Pandora Streams`, na.rm = TRUE)
  )
)
# Sort by total engagement
platform_totals <- platform_totals %>% arrange(desc(Total_Engagement))
print(platform_totals)
##     Platform Total_Engagement
## 1     TikTok     5.341326e+12
## 2    Spotify     2.057982e+12
## 3    YouTube     1.852865e+12
## 4    Pandora     3.940698e+11
## 5 Soundcloud     6.830065e+10

Interpretation: The platform with the highest total engagement is currently the most influential in music streaming culture.

#(c)Which artist has the highest combined social media reach (YouTube + TikTok + Spotify)?
# Calculate total social media reach
data <- data %>%
  mutate(Social_Reach = `YouTube Views` + `TikTok Views` + `Spotify Playlist Reach`)

# Group by artist and sum their total reach
artist_reach <- data %>%
  group_by(Artist) %>%
  summarise(Total_Reach = sum(Social_Reach, na.rm = TRUE)) %>%
  arrange(desc(Total_Reach))

# Display the artist with the highest total social reach
top_social_artist <- artist_reach %>% slice(1)
print(top_social_artist)
## # A tibble: 1 × 2
##   Artist         Total_Reach
##   <chr>                <dbl>
## 1 Kevin MacLeod 233245259334

Interpretation: This artist not only has musical popularity but also strong social presence, likely contributing to song virality.

#(d) What is the total number of streams per artist?
# Summarize total streams by artist
artist_streams <- data %>% group_by(Artist) %>% summarize(total_streams = sum(`Spotify Streams`, na.rm = TRUE))

# Print the result
print(artist_streams)
## # A tibble: 2,000 × 2
##    Artist               total_streams
##    <chr>                        <dbl>
##  1 "\"XY\""                447387315.
##  2 "$OHO BANI"              54065563 
##  3 "$uicideboy$"          1697447430 
##  4 "&ME"                    34601626 
##  5 "(G)I-DLE"              876938452 
##  6 "*NSYNC"                 69041864 
##  7 ".diedlonely"            44235866 
##  8 "10CM"                   18800716 
##  9 "13 Organis\xef\xbf"    267241905 
## 10 "1da Banton"            154010050 
## # ℹ 1,990 more rows

Interpretation: Artists with high total streams have sustained performance, indicating popularity over multiple tracks.

#4. Sorting and Ranking Data

#(a)Which songs stayed in the top 10 "All Time Rank" the longest?
# Count occurrences of songs in the top 10 of "All Time Rank"
top_10_songs <- data %>%
  filter(`All Time Rank` <= 10) %>%
  group_by(Track) %>%
  summarise(Top10_Appearances = n()) %>%
  arrange(desc(Top10_Appearances))

# Print the result
print(top_10_songs)
## # A tibble: 10 × 2
##    Track                      Top10_Appearances
##    <chr>                                  <int>
##  1 BAND4BAND (feat. Lil Baby)                 1
##  2 Beautiful Things                           1
##  3 Danza Kuduro - Cover                       1
##  4 Flowers                                    1
##  5 Gata Only                                  1
##  6 Houdini                                    1
##  7 Lovin On Me                                1
##  8 MILLION DOLLAR BABY                        1
##  9 Not Like Us                                1
## 10 i like the way you kiss me                 1

Interpretation: Songs frequently in the top 10 are highly resilient and have long-lasting appeal.

#(b)What is the correlation between YouTube Views and Spotify Streams?
# Calculate correlation between YouTube Views and Spotify Streams
correlation <- cor(data$`YouTube Views`, data$`Spotify Streams`, use = "complete.obs")
# Print correlation result
print(paste("Correlation between YouTube Views and Spotify Streams:", round(correlation, 2)))
## [1] "Correlation between YouTube Views and Spotify Streams: 0.43"

Interpretation: Confirms cross-platform impact; useful for predicting performance on one platform based on another.

#(c)Which song had the most consistent ranking across multiple platforms?
# Select ranking-related columns (adjust column names based on dataset)
ranking_columns <- c("Spotify Popularity", "YouTube Views", "TikTok Likes", "Apple Music Playlist Count")

# Calculate variance in rankings for each song
ranking_variability <- data %>%
  rowwise() %>%
  mutate(Rank_Variance = var(c_across(all_of(ranking_columns)), na.rm = TRUE)) %>%
  select(Track, Rank_Variance) %>%
  arrange(Rank_Variance)

# Print the song with the most consistent ranking (lowest variance)
print(ranking_variability %>% head(10))
## # A tibble: 10 × 2
## # Rowwise: 
##    Track                            Rank_Variance
##    <chr>                                    <dbl>
##  1 "Straight and Narrow"              5803046608.
##  2 "She���s Gone, Da"                 7320406633.
##  3 "Talk talk"                        9819019840.
##  4 "Magic Johnson"                   13244746812.
##  5 "BBY BOO (REMIX)"                101036019237.
##  6 "I'm A Star - From \"Wish\""     105663385861.
##  7 "Atmosphere"                     133259941130 
##  8 "High Road (feat. Jessie Murph)" 136332811523.
##  9 "Solid (feat. Drake)"            217338537016 
## 10 "Give It To Me - Full Vocal Mix" 229144336433.

Interpretation: Low variance means the song performs equally well across platforms, which is rare and valuable.

#(d)Which songs were trending on both TikTok and Spotify simultaneously?
# Define the threshold for top 10% in both TikTok and Spotify
spotify_threshold <- quantile(data$`Spotify Streams`, 0.90, na.rm = TRUE)
tiktok_threshold <- quantile(data$`TikTok Posts`, 0.90, na.rm = TRUE)

# Filter songs that meet both conditions
trending_songs <- data %>%
  filter(`Spotify Streams` >= spotify_threshold & `TikTok Posts` >= tiktok_threshold) %>%
  select(Track, Artist, `Spotify Streams`, `TikTok Posts`)

# Print the result
print(trending_songs)
## # A tibble: 79 × 4
##    Track                     Artist        `Spotify Streams` `TikTok Posts`
##    <chr>                     <chr>                     <dbl>          <dbl>
##  1 Flowers                   Miley Cyrus          2031280633        7189811
##  2 greedy                    Tate McRae           1258569694        2294429
##  3 As It Was                 Harry Styles         3301814535        2755903
##  4 STAY (with Justin Bieber) The Kid LAROI        3107100349        7485966
##  5 Dance Monkey              Tones And I          3071214106       10342366
##  6 Shape of You              Ed Sheeran           3909458734        2270315
##  7 Blinding Lights           The Weeknd           4281468720        2882064
##  8 Unholy (feat. Kim Petras) Sam Smith            1556275789        2379787
##  9 Me Porto Bonito           Bad Bunny            1811990630        4506600
## 10 Perfect                   Ed Sheeran           2969999682        6642975
## # ℹ 69 more rows

Interpretation: These tracks are trendsetters ; strong on both user-generated content (TikTok) and passive listening (Spotify).

# 5. Feature Engineering

# (a). Composite Popularity Score (standardized sum of key metrics)
data <- data %>%
  mutate(Composite_Popularity_Score = scale(`Spotify Streams`) +
           scale(`YouTube Views`) +
           scale(`TikTok Views`) +
           scale(`Spotify Playlist Count`) +
           scale(`Spotify Popularity`) +
           scale(`Shazam Counts`))




# Print summary of new composite score
print(summary(data$Composite_Popularity_Score))
##        V1         
##  Min.   :-7.0735  
##  1st Qu.:-2.1585  
##  Median :-0.9755  
##  Mean   : 0.0000  
##  3rd Qu.: 1.2137  
##  Max.   :40.2469

Interpretation: A comprehensive score helps rank tracks holistically across platforms, smoothing out biases from individual metrics.

# (b) Engagement Ratios
data <- data %>%
  mutate(
    Views_per_Playlist = ifelse(`YouTube Playlist Reach` > 0,
                                `YouTube Views` / `YouTube Playlist Reach`, NA),
    TikTok_Engagement_Rate = ifelse(`TikTok Posts` > 0,
                                    `TikTok Likes` / `TikTok Posts`, NA),
    Spotify_Reach_per_Stream = ifelse(`Spotify Streams` > 0,
                                      `Spotify Playlist Reach` / `Spotify Streams`, NA)
  )

# Print example rows for engagement ratios
print("First 5 rows of Engagement Ratios:")
## [1] "First 5 rows of Engagement Ratios:"
print(data %>% select(Views_per_Playlist, TikTok_Engagement_Rate, Spotify_Reach_per_Stream) %>% head(5))
## # A tibble: 5 × 3
##   Views_per_Playlist TikTok_Engagement_Rate Spotify_Reach_per_Stream
##                <dbl>                  <dbl>                    <dbl>
## 1              0.560                  113.                    0.504 
## 2              0.744                   52.2                   0.539 
## 3              0.328                   90.9                   0.352 
## 4              0.327                  150.                    0.0672
## 5              0.686                 6868.                    1.42

Interpretation:Ratios like Views per Playlist, TikTok Engagement Rate help measure audience interaction efficiency rather than just raw numbers.

# (c). Cross-Platform Popularity
data <- data %>%
  mutate(Cross_Platform_Popularity = `Spotify Streams` + `YouTube Views` + `TikTok Views` + `Pandora Streams` + `Soundcloud Streams`)

# Print top songs by Cross-Platform Popularity
top_cross_platform_songs <- data %>%
  arrange(desc(Cross_Platform_Popularity)) %>%
  select(Track, Artist, Cross_Platform_Popularity)
print("Top songs by Cross-Platform Popularity:")
## [1] "Top songs by Cross-Platform Popularity:"
print(head(top_cross_platform_songs, 10))
## # A tibble: 10 × 3
##    Track                     Artist                   Cross_Platform_Popularity
##    <chr>                     <chr>                                        <dbl>
##  1 Monkeys Spinning Monkeys  Kevin MacLeod                        233355761424.
##  2 Love You So               The King Khan & BBQ Show             214882841953.
##  3 Oh No                     Kreepa                                61162058226.
##  4 Funny Song                Cavendish Music                       38406176345.
##  5 Aesthetic                 Tollan Kim                            33894494044.
##  6 Spongebob                 Dante9k                               33777959787.
##  7 She Share Story           Shayne Orok                           33664390781.
##  8 STAY (with Justin Bieber) The Kid LAROI                         28309576032 
##  9 Pieces                    Danilo Stankovic                      28138961047.
## 10 love nwantiti (ah ah ah)  CKay                                  25965134049.
# Final structure of the dataset
print("Updated structure of dataset:")
## [1] "Updated structure of dataset:"
str(data)
## tibble [4,600 × 34] (S3: tbl_df/tbl/data.frame)
##  $ Track                     : chr [1:4600] "MILLION DOLLAR BABY" "Not Like Us" "i like the way you kiss me" "Flowers" ...
##  $ Album Name                : chr [1:4600] "Million Dollar Baby - Single" "Not Like Us" "I like the way you kiss me" "Flowers - Single" ...
##  $ Artist                    : chr [1:4600] "Tommy Richman" "Kendrick Lamar" "Artemas" "Miley Cyrus" ...
##  $ Release Date              : chr [1:4600] "4/26/2024" "5/4/2024" "3/19/2024" "1/12/2023" ...
##  $ ISRC                      : chr [1:4600] "QM24S2402528" "USUG12400910" "QZJ842400387" "USSM12209777" ...
##  $ All Time Rank             : num [1:4600] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Track Score               : num [1:4600] 725 546 538 445 423 ...
##  $ Spotify Streams           : num [1:4600] 3.90e+08 3.24e+08 6.01e+08 2.03e+09 1.07e+08 ...
##  $ Spotify Playlist Count    : num [1:4600] 30716 28113 54331 269802 7223 ...
##  $ Spotify Playlist Reach    : num [1:4600] 1.97e+08 1.75e+08 2.12e+08 1.37e+08 1.51e+08 ...
##  $ Spotify Popularity        : num [1:4600] 92 92 92 85 88 ...
##  $ YouTube Views             : num [1:4600] 8.43e+07 1.16e+08 1.23e+08 1.10e+09 7.74e+07 ...
##  $ YouTube Likes             : num [1:4600] 1713126 3486739 2228730 10629796 3670188 ...
##  $ TikTok Posts              : num [1:4600] 5767700 674700 3025400 7189811 16400 ...
##  $ TikTok Likes              : num [1:4600] 6.52e+08 3.52e+07 2.75e+08 1.08e+09 1.13e+08 ...
##  $ TikTok Views              : num [1:4600] 5.33e+09 2.08e+08 3.37e+09 1.46e+10 1.16e+09 ...
##  $ YouTube Playlist Reach    : num [1:4600] 1.51e+08 1.56e+08 3.74e+08 3.35e+09 1.13e+08 ...
##  $ Apple Music Playlist Count: num [1:4600] 210 188 190 394 182 ...
##  $ AirPlay Spins             : num [1:4600] 40975 40778 74333 1474799 12185 ...
##  $ SiriusXM Spins            : num [1:4600] 684 3 536 2182 1 ...
##  $ Deezer Playlist Count     : num [1:4600] 62 67 136 264 82 ...
##  $ Deezer Playlist Reach     : num [1:4600] 17598718 10422430 36321847 24684248 17660624 ...
##  $ Amazon Playlist Count     : num [1:4600] 114 111 172 210 105 ...
##  $ Pandora Streams           : num [1:4600] 1.80e+07 7.78e+06 5.02e+06 1.90e+08 4.49e+06 ...
##  $ Pandora Track Stations    : num [1:4600] 22931 28444 5639 203384 7006 ...
##  $ Soundcloud Streams        : num [1:4600] 4818457 6623075 7208651 14847968 207179 ...
##  $ Shazam Counts             : num [1:4600] 2669262 1118279 5285340 11822942 457017 ...
##  $ Explicit Track            : num [1:4600] 0 1 0 0 1 1 0 1 1 1 ...
##  $ Social_Reach              : num [1:4600] 5.61e+09 4.99e+08 3.70e+09 1.58e+10 1.39e+09 ...
##  $ Composite_Popularity_Score: num [1:4600, 1] 1.78 0.408 2.654 12.667 -0.56 ...
##   ..- attr(*, "scaled:center")= num 4.47e+08
##   ..- attr(*, "scaled:scale")= num 5.32e+08
##  $ Views_per_Playlist        : num [1:4600] 0.56 0.744 0.328 0.327 0.686 ...
##  $ TikTok_Engagement_Rate    : num [1:4600] 113 52.2 90.9 150 6868.1 ...
##  $ Spotify_Reach_per_Stream  : num [1:4600] 0.5036 0.5394 0.3519 0.0672 1.4151 ...
##  $ Cross_Platform_Popularity : num [1:4600] 5.83e+09 6.63e+08 4.11e+09 1.79e+10 1.35e+09 ...

Interpretation:The table highlights the top songs that have achieved the highest combined popularity across major streaming platforms, reflecting strong cross-platform audience engagement and widespread digital reach

10 data analysis questions

(a)Does the “Explicit Track” have a significant effect on the “YouTube Views”? Hypotheses: Null Hypothesis (H₀): There is no significant difference in YouTube Views between Explicit and Non-Explicit tracks.

Alternative Hypothesis (H₁): There is a significant difference in YouTube Views between Explicit and Non-Explicit tracks.

# Ensure "Explicit Track" is a factor (categorical variable)
data$`Explicit Track` <- as.factor(data$`Explicit Track`)

# Perform ANOVA
anova_result <- aov(`YouTube Views` ~ `Explicit Track`, data = data)

# Display the summary of the ANOVA result
summary(anova_result)
##                    Df    Sum Sq   Mean Sq F value Pr(>F)    
## `Explicit Track`    1 4.600e+19 4.600e+19   102.3 <2e-16 ***
## Residuals        4598 2.068e+21 4.498e+17                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# box plot to visualize YouTube Views by Explicit Track category
ggplot(data, aes(x = `Explicit Track`, y = `YouTube Views`, fill = `Explicit Track`)) +
  geom_boxplot() +
  labs(title = "YouTube Views by Explicit Track", 
       x = "Explicit Track", 
       y = "YouTube Views") +
  theme_minimal() +
  scale_fill_manual(values = c("skyblue", "salmon"))  # Customize colors

Interpretation:Since the p-value is very small, we can conclude that there is a significant difference in YouTube Views based on whether the track is marked as explicit or not. In other words, whether a track is explicit or not does have an effect on the number of YouTube views.

b.What is the relationship between “Spotify Streams” and “YouTube Views”?

Hypothesis: There is a linear relationship between the number of Spotify streams and YouTube views.

# Perform linear regression
linear_model <- lm(`YouTube Views` ~ `Spotify Streams`, data = data)

# Summary of the linear model
summary(linear_model)
## 
## Call:
## lm(formula = `YouTube Views` ~ `Spotify Streams`, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.112e+09 -2.228e+08 -1.466e+08  6.210e+07  1.577e+10 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.551e+08  1.177e+07   13.18   <2e-16 ***
## `Spotify Streams` 5.537e-01  1.694e-02   32.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 610800000 on 4598 degrees of freedom
## Multiple R-squared:  0.1886, Adjusted R-squared:  0.1884 
## F-statistic:  1069 on 1 and 4598 DF,  p-value: < 2.2e-16
# Create a scatter plot with the regression line
ggplot(data, aes(x = `Spotify Streams`, y = `YouTube Views`)) +
  geom_point(color = "blue") +  # Scatter plot points
  geom_smooth(method = "lm", col = "red") +  # Add regression line
  labs(title = "Linear Relationship between Spotify Streams and YouTube Views", 
       x = "Spotify Streams", 
       y = "YouTube Views") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Interpretation:While Spotify Streams do predict YouTube Views significantly, the relationship is weak, indicating that other factors likely influence YouTube Views too.

  1. Correlation between TikTok views and Spotify streams
# Calculate correlation
correlation_tiktok_streams <- cor(data$`TikTok Views`, data$`Spotify Streams`, use = "complete.obs")
print(paste("Correlation between TikTok Views and Spotify Streams:", round(correlation_tiktok_streams, 2)))
## [1] "Correlation between TikTok Views and Spotify Streams: 0.03"
# Scatter plot to visualize the relationship
ggplot(data, aes(x = `TikTok Views`, y = `Spotify Streams`)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Correlation between TikTok Views and Spotify Streams",
       x = "TikTok Views",
       y = "Spotify Streams")
## `geom_smooth()` using formula = 'y ~ x'

Interpretation:The correlation coefficient between TikTok Views and Spotify Streams is 0.03, indicating a very weak positive relationship, suggesting that TikTok popularity has almost no linear influence on Spotify streaming numbers.

  1. Correlation between the number of TikTok posts and TikTok likes
# Calculate correlation
correlation_tiktok_posts_likes <- cor(data$`TikTok Posts`, data$`TikTok Likes`, use = "complete.obs")
print(paste("Correlation between TikTok Posts and TikTok Likes:", round(correlation_tiktok_posts_likes, 2)))
## [1] "Correlation between TikTok Posts and TikTok Likes: 0.5"
# Scatter plot to visualize the relationship
ggplot(data, aes(x = `TikTok Posts`, y = `TikTok Likes`)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Correlation between TikTok Posts and TikTok Likes",
       x = "TikTok Posts",
       y = "TikTok Likes")
## `geom_smooth()` using formula = 'y ~ x'

Interpretation:The correlation between TikTok Posts and TikTok Likes is 0.5, indicating a moderate positive relationship, meaning more posts are generally associated with more likes, but the relationship is not very strong.

  1. How well do YouTube views predict Spotify streams?
# Linear regression model
model_youtube_streams <- lm(`Spotify Streams` ~ `YouTube Views`, data = data)
summary(model_youtube_streams)
## 
## Call:
## lm(formula = `Spotify Streams` ~ `YouTube Views`, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.149e+09 -2.762e+08 -1.467e+08  1.466e+08  3.814e+09 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.102e+08  8.216e+06   37.75   <2e-16 ***
## `YouTube Views` 3.406e-01  1.042e-02   32.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 479100000 on 4598 degrees of freedom
## Multiple R-squared:  0.1886, Adjusted R-squared:  0.1884 
## F-statistic:  1069 on 1 and 4598 DF,  p-value: < 2.2e-16
# Scatter plot with regression line
ggplot(data, aes(x = `YouTube Views`, y = `Spotify Streams`)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "YouTube Views Predicting Spotify Streams",
       x = "YouTube Views",
       y = "Spotify Streams")
## `geom_smooth()` using formula = 'y ~ x'

Interpretation:YouTube Views have a significant positive effect on Spotify Streams (p < 0.001). For every additional YouTube view, Spotify Streams increase by about 0.34. However, YouTube Views explain only 18.86% of the variation in Spotify Streams, indicating that other factors also influence Spotify success. The model is statistically significant but has a weak to moderate fit.

  1. Effect of the number of playlists on Spotify streams
# Linear regression model
model_playlist_streams <- lm(`Spotify Streams` ~ `Spotify Playlist Count`, data = data)
summary(model_playlist_streams)
## 
## Call:
## lm(formula = `Spotify Streams` ~ `Spotify Playlist Count`, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.052e+09 -1.171e+08 -7.185e+07  4.792e+07  3.814e+09 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              9.055e+07  6.183e+06   14.64   <2e-16 ***
## `Spotify Playlist Count` 6.008e+03  6.703e+01   89.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 320900000 on 4598 degrees of freedom
## Multiple R-squared:  0.636,  Adjusted R-squared:  0.636 
## F-statistic:  8035 on 1 and 4598 DF,  p-value: < 2.2e-16
ggplot(data, aes(x = `Spotify Playlist Count`, y = `Spotify Streams`)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Effect of Spotify Playlist Count on Spotify Streams",
       x = "Spotify Playlist Count",
       y = "Spotify Streams")
## `geom_smooth()` using formula = 'y ~ x'

Interpretation-The regression analysis shows that YouTube Views have a positive relationship with Spotify Streams. For every additional YouTube view, Spotify Streams increase by approximately 0.34. The model is statistically significant, but the weak R-squared value of 0.1886 indicates that YouTube Views explain only about 19% of the variation in Spotify Streams, suggesting other factors are at play.

g.How does the track score influence Spotify popularity?

model_score_popularity <- lm(`Spotify Popularity` ~ `Track Score`, data = data)
summary(model_score_popularity)
## 
## Call:
## lm(formula = `Spotify Popularity` ~ `Track Score`, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.455  -1.170   1.608   7.750  25.909 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   60.650527   0.314895  192.60   <2e-16 ***
## `Track Score`  0.068135   0.005535   12.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.47 on 4598 degrees of freedom
## Multiple R-squared:  0.0319, Adjusted R-squared:  0.03169 
## F-statistic: 151.5 on 1 and 4598 DF,  p-value: < 2.2e-16
# Scatter plot with regression line
ggplot(data, aes(x = `Track Score`, y = `Spotify Popularity`)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Track Score Influencing Spotify Popularity",
       x = "Track Score",
       y = "Spotify Popularity")
## `geom_smooth()` using formula = 'y ~ x'

Interpretation-The regression analysis shows a positive relationship between Track Score and Spotify Popularity. For each unit increase in Track Score, Spotify Popularity increases by approximately 0.068. The model is statistically significant, but the R-squared value of 0.0319 indicates that Track Score explains only about 3.2% of the variation in Spotify Popularity, suggesting other factors contribute to its popularity.

h.How do Spotify Playlist Counts vary by Artist?

top_artists <- data %>%
  group_by(Artist) %>%
  summarise(Total_Playlist_Count = sum(`Spotify Playlist Count`, na.rm = TRUE)) %>%
  arrange(desc(Total_Playlist_Count)) %>%
  slice(1:10)
ggplot(top_artists, aes(x = reorder(Artist, -Total_Playlist_Count), y = Total_Playlist_Count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Top 10 Artists by Spotify Playlist Count", x = "Artist", y = "Total Playlist Count") +
  coord_flip()

Interpretation:The plot displays the top 10 artists with the highest total playlist counts on Spotify. The artists are represented on the y-axis, and their respective total playlist counts are shown on the x-axis. The bars are oriented horizontally, making it easier to compare the total playlist count for each artist. The color of the bars is steel blue.

i.How is the distribution of Track Scores across all songs?

# Plot histogram for Track Score
ggplot(data, aes(x = `Track Score`)) +
  geom_histogram(binwidth = 10, fill = "steelblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Track Scores",
       x = "Track Score",
       y = "Number of Tracks") +
  theme_minimal()

Interpretation-The histogram shows the distribution of track scores, with scores grouped into intervals of 10. The bars represent the frequency of tracks within each score range, giving an overview of how track scores are spread across the dataset.

j.distribuition of total digital presence

# Calculate the total sums for each important platform column
platform_totals <- c(
  "Spotify Streams" = sum(data$`Spotify Streams`, na.rm = TRUE),
  "YouTube Views" = sum(data$`YouTube Views`, na.rm = TRUE),
  "TikTok Views" = sum(data$`TikTok Views`, na.rm = TRUE),
  "AirPlay Spins" = sum(data$`AirPlay Spins`, na.rm = TRUE),
  "SiriusXM Spins" = sum(data$`SiriusXM Spins`, na.rm = TRUE),
  "Deezer Playlist Reach" = sum(data$`Deezer Playlist Reach`, na.rm = TRUE),
  "Amazon Playlist Count" = sum(data$`Amazon Playlist Count`, na.rm = TRUE),
  "Pandora Streams" = sum(data$`Pandora Streams`, na.rm = TRUE),
  "Soundcloud Streams" = sum(data$`Soundcloud Streams`, na.rm = TRUE),
  "Shazam Counts" = sum(data$`Shazam Counts`, na.rm = TRUE)
)

# Convert totals to a data frame for plotting
platform_df <- data.frame(
  Platform = names(platform_totals),
  Total = as.numeric(platform_totals)
)
# Create the pie chart
ggplot(platform_df, aes(x = "", y = Total, fill = Platform)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  labs(
    title = "    Distribution of Total Digital Presence",
    fill = "Platform"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    legend.title = element_text(face = "bold"),
    legend.text = element_text(size = 10)
  ) +
  scale_fill_brewer(palette = "Paired")

Interpretation-Tiktok Views have the most distribution of total digital presence

The findings from the Spotify Songs 2024 dataset analysis would be highly valuable for several key groups:

1.Record Labels and Music Producers: To identify successful artists, understand listener behavior across platforms, and develop targeted marketing strategies to maximize song popularity and revenue.

2.Artists and Music Managers: To assess their performance relative to competitors, find opportunities to expand reach across platforms like TikTok and YouTube, and focus on areas that drive engagement and streaming numbers.

3.Marketing and Advertising Agencies: To design promotional campaigns based on insights into where and how audiences engage with music most actively.

4.Streaming Platforms (Spotify, YouTube, TikTok, etc.): To refine recommendation algorithms, develop partnerships with trending artists, and improve user experience by aligning content with listener preferences.

5.Music Analysts and Researchers: To explore industry trends, study platform correlations, and forecast future music consumption patterns.

6.Event Organizers and Concert Promoters: To choose trending artists for events and understand the kind of music that is currently resonating with large audiences.

7.Investors and Talent Scouts: To make data-driven decisions about which emerging artists or songs to invest in for future growth and returns.

Conclusion

The analysis of the Spotify Songs 2024 dataset provided comprehensive insights into streaming trends, artist popularity, and cross-platform engagement patterns. Data preprocessing successfully ensured the reliability and consistency of the dataset by handling missing values, removing duplicates, and performing necessary data type conversions. Through extraction and filtering, key observations such as the identification of high-streaming songs, dominant artists, and the influence of major record labels were made. The study revealed significant correlations between Spotify streams and YouTube views, highlighting strong platform interconnectivity.

Grouping and summarization techniques offered a deeper understanding of platform popularity and social media reach, helping to identify artists with substantial cross-platform influence. Sorting and ranking processes uncovered songs with enduring success in top rankings and emphasized consistency in performance across multiple platforms. Feature engineering enriched the dataset by creating new variables like Composite Popularity Scores and engagement ratios, allowing a more nuanced measurement of song and artist performance.

Statistical analyses, including ANOVA and regression modeling, provided evidence of significant differences in genre-wise streams and established predictive relationships between artist popularity and streaming numbers. Visualization through ggplot2 enabled clear depiction of critical patterns, such as the distribution of Spotify streams by genre and the correlation between artist popularity and streaming figures.

Overall, the analysis not only strengthened the understanding of music streaming behavior in 2024 but also highlighted the importance of cross-platform dynamics and multi-faceted engagement metrics in evaluating musical success. These insights can assist record labels, artists, and marketers in developing more data-driven strategies for promoting music and optimizing audience reach across digital platforms.