Introduction

In this program, we will explore a list of the most streamed songs of 2023 according to Spotify. The dataset, sourced from Kaggle, includes detailed information about each song, such as its title, artist, release date, and availability on music platforms like Spotify, Apple Music, and Shazam. It also provides streaming statistics and various audio attributes for each song.

Import the data

We begin by importing the dataset, which has been downloaded directly from Kaggle to my working directory.

# Set working directory
setwd("/Users/elnazkhalili/Desktop/R Programming/mydata")

# Import the file
library(readr)
spotify <- read_csv("spotify-2023.csv")

Preview the data

Next, we’ll preview the data to confirm that the import was successful.

library(knitr)

head(spotify) # Preview the first few rows.
## # A tibble: 6 × 24
##   track_name          `artist(s)_name` artist_count released_year released_month
##   <chr>               <chr>                   <dbl>         <dbl>          <dbl>
## 1 Flowers             Miley Cyrus                 1          2023              1
## 2 Ella Baila Sola     Eslabon Armado,…            2          2023              3
## 3 Shakira: Bzrp Musi… Shakira, Bizarr…            2          2023              1
## 4 TQG                 Karol G, Shakira            2          2023              2
## 5 La Bebe - Remix     Peso Pluma, Yng…            2          2023              3
## 6 Die For You - Remix Ariana Grande, …            2          2023              2
## # ℹ 19 more variables: released_day <dbl>, in_spotify_playlists <dbl>,
## #   in_spotify_charts <dbl>, streams <dbl>, in_apple_playlists <dbl>,
## #   in_apple_charts <dbl>, in_deezer_playlists <dbl>, in_deezer_charts <dbl>,
## #   in_shazam_charts <dbl>, bpm <dbl>, key <chr>, mode <chr>,
## #   `danceability_%` <dbl>, `valence_%` <dbl>, `energy_%` <dbl>,
## #   `acousticness_%` <dbl>, `instrumentalness_%` <dbl>, `liveness_%` <dbl>,
## #   `speechiness_%` <dbl>

Dataset Summary

Here is some immediate information we can extract: the dataset contains 952 observations and 24 variables. Below is a summary of these variables.

dim(spotify) # Displays the number of rows and columns.
## [1] 952  24
names(spotify) # Displays the names of our variables (columns).
##  [1] "track_name"           "artist(s)_name"       "artist_count"        
##  [4] "released_year"        "released_month"       "released_day"        
##  [7] "in_spotify_playlists" "in_spotify_charts"    "streams"             
## [10] "in_apple_playlists"   "in_apple_charts"      "in_deezer_playlists" 
## [13] "in_deezer_charts"     "in_shazam_charts"     "bpm"                 
## [16] "key"                  "mode"                 "danceability_%"      
## [19] "valence_%"            "energy_%"             "acousticness_%"      
## [22] "instrumentalness_%"   "liveness_%"           "speechiness_%"
str(spotify) # Provides an overview of the data frame structure and contents.
## spc_tbl_ [952 × 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ track_name          : chr [1:952] "Flowers" "Ella Baila Sola" "Shakira: Bzrp Music Sessions, Vol. 53" "TQG" ...
##  $ artist(s)_name      : chr [1:952] "Miley Cyrus" "Eslabon Armado, Peso Pluma" "Shakira, Bizarrap" "Karol G, Shakira" ...
##  $ artist_count        : num [1:952] 1 2 2 2 2 2 2 1 2 1 ...
##  $ released_year       : num [1:952] 2023 2023 2023 2023 2023 ...
##  $ released_month      : num [1:952] 1 3 1 2 3 2 4 2 1 1 ...
##  $ released_day        : num [1:952] 12 16 11 23 17 24 17 24 23 2 ...
##  $ in_spotify_playlists: num [1:952] 12211 3090 5724 4284 2953 ...
##  $ in_spotify_charts   : num [1:952] 115 50 44 49 44 47 40 77 26 27 ...
##  $ streams             : num [1:952] 1.32e+09 7.26e+08 7.22e+08 6.19e+08 5.54e+08 ...
##  $ in_apple_playlists  : num [1:952] 300 34 119 115 49 87 41 91 19 26 ...
##  $ in_apple_charts     : num [1:952] 215 222 108 123 110 86 205 212 143 124 ...
##  $ in_deezer_playlists : num [1:952] 745 43 254 184 66 74 54 78 10 15 ...
##  $ in_deezer_charts    : num [1:952] 58 13 29 18 13 1 12 6 6 1 ...
##  $ in_shazam_charts    : num [1:952] 1021 418 22 354 339 ...
##  $ bpm                 : num [1:952] 118 148 122 180 170 67 83 120 138 127 ...
##  $ key                 : chr [1:952] NA "F" "D" "E" ...
##  $ mode                : chr [1:952] "Major" "Minor" "Minor" "Minor" ...
##  $ danceability_%      : num [1:952] 71 67 78 72 81 53 57 78 78 80 ...
##  $ valence_%           : num [1:952] 65 83 50 61 56 50 56 76 89 74 ...
##  $ energy_%            : num [1:952] 68 76 63 63 48 53 72 59 83 77 ...
##  $ acousticness_%      : num [1:952] 6 48 27 67 21 23 23 43 10 36 ...
##  $ instrumentalness_%  : num [1:952] 0 0 0 0 0 0 0 0 0 0 ...
##  $ liveness_%          : num [1:952] 3 8 9 9 8 44 27 34 12 11 ...
##  $ speechiness_%       : num [1:952] 7 3 5 28 33 7 5 3 5 4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   track_name = col_character(),
##   ..   `artist(s)_name` = col_character(),
##   ..   artist_count = col_double(),
##   ..   released_year = col_double(),
##   ..   released_month = col_double(),
##   ..   released_day = col_double(),
##   ..   in_spotify_playlists = col_double(),
##   ..   in_spotify_charts = col_double(),
##   ..   streams = col_double(),
##   ..   in_apple_playlists = col_double(),
##   ..   in_apple_charts = col_double(),
##   ..   in_deezer_playlists = col_number(),
##   ..   in_deezer_charts = col_double(),
##   ..   in_shazam_charts = col_number(),
##   ..   bpm = col_double(),
##   ..   key = col_character(),
##   ..   mode = col_character(),
##   ..   `danceability_%` = col_double(),
##   ..   `valence_%` = col_double(),
##   ..   `energy_%` = col_double(),
##   ..   `acousticness_%` = col_double(),
##   ..   `instrumentalness_%` = col_double(),
##   ..   `liveness_%` = col_double(),
##   ..   `speechiness_%` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Before we go any further, we’ll use the attach() function so we can reference columns in the data frame directly by the column names.

attach(spotify)

Average Streams

Compute the average number of streams for the most popular songs of 2023.

mean_streams <- mean(streams, na.rm = TRUE) # Compute the mean of the 'streams' column and exclude any missing values.

formatted_mean <- format(mean_streams, big.mark = ",", scientific = FALSE) # Formats output to make large numbers easier to read by adding commas as thousands separators and avoids scientific notation for better readability.

paste("The average number of streams for the", nrow(spotify), "most popular songs of 2023 on Spotify is", formatted_mean)
## [1] "The average number of streams for the 952 most popular songs of 2023 on Spotify is 514,137,425"

Top Tracks

Compute the top 5 most-streamed tracks and their respective artists.

top_tracks <- spotify %>%
  arrange(desc(streams)) %>% # Sort by streams in descending order
  head(5) %>%  # Select the top 5 tracks
  select("track_name", "artist(s)_name", "streams") %>%  # Select relevant columns
  mutate(streams = format(streams, big.mark = ",", scientific = FALSE)) # Format the streams column
top_tracks
## # A tibble: 5 × 3
##   track_name                                    `artist(s)_name`      streams   
##   <chr>                                         <chr>                 <chr>     
## 1 Blinding Lights                               The Weeknd            3,703,895…
## 2 Shape of You                                  Ed Sheeran            3,562,543…
## 3 Someone You Loved                             Lewis Capaldi         2,887,241…
## 4 Dance Monkey                                  Tones and I           2,864,791…
## 5 Sunflower - Spider-Man: Into the Spider-Verse Post Malone, Swae Lee 2,808,096…

Top Artists

Compute which artists had the highest total number of streams.

# First, process and format total streams by artist
top_artists <- spotify %>%
  group_by(`artist(s)_name`) %>%  # Group by artist name
  summarise(total_streams = sum(streams, na.rm = TRUE)) %>%  # Sum up total streams for each artist
  arrange(desc(total_streams)) %>%  # Sort total streams in descending order
  mutate(total_streams = format(total_streams, big.mark = ",", scientific = FALSE))  # Format the total streams column
top_artists
## # A tibble: 644 × 2
##    `artist(s)_name` total_streams   
##    <chr>            <chr>           
##  1 The Weeknd       "14,185,552,870"
##  2 Taylor Swift     "14,053,658,300"
##  3 Ed Sheeran       "13,908,947,204"
##  4 Harry Styles     "11,608,645,649"
##  5 Bad Bunny        " 9,997,799,607"
##  6 Olivia Rodrigo   " 7,442,148,916"
##  7 Eminem           " 6,183,805,596"
##  8 Bruno Mars       " 5,846,920,599"
##  9 Arctic Monkeys   " 5,569,806,731"
## 10 Imagine Dragons  " 5,272,484,650"
## # ℹ 634 more rows
# Then, extract the top artist and their total streams
top_artist <- top_artists %>% 
  slice(1) %>% # Pull the top artist (first row)
  pull(`artist(s)_name`) # Extract the artist name

top_streams <- top_artists %>% 
  slice(1) %>% # Pull the top artist (first row)
  pull(total_streams) # Extract the total streams

paste("The top artist is", top_artist, "with a total of", top_streams, "streams.")
## [1] "The top artist is The Weeknd with a total of 14,185,552,870 streams."

Number of Streams and Danceability

Analyze the relationship between the number of streams and danceability, which represents the percentage of how suitable a song is for dancing.

library(ggplot2)
library(scales)

# First, we'll create a line greaph to visualize the relationship.
ggplot(spotify, aes(x = `danceability_%`, y = streams)) +
  geom_line(color = "#57068c", linewidth = 1) +  # Line connecting points
  scale_y_continuous(labels = comma) +  # Format y-axis labels with commas
  labs(
    title = "Comparison of Streams and Danceability",
    x = "Danceability (%)",
    y = "Number of Streams"
  ) +
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    plot.title = element_text(size = 14, face = "bold")
  )

# Next, we'll compute the correlation coefficient.
cor(`danceability_%`, `streams`, use = "complete.obs")
## [1] -0.1054569

From the line graph, we can infer that there is no strong or consistent correlation between the danceability of a track and the number of streams. The graph suggests that variations in danceability do not directly impact the number of streams, as evidenced by the fluctuating nature of the line trend.

The correlation coefficient of \(-0.105\) confirms that there is a very weak negative relationship between danceability and the number of streams. This suggests that danceability has little impact on stream count, and other factors are likely more influential in determining a track’s popularity.

Although there is still much more analysis to be conducted on this dataset, this concludes our initial round of exploration.