Introduction

1.1 Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this?

Along with the development of social media, people tend to experience FOMO (69% of U.S people have experienced FOMO), which leads them to listen to music based on its popularity. Although the creative process is inherently subjective, is there a “formula” for a hit song? In this project, we aim to provide a comprehensive analytical report that not only helps to understand “market trends” but also highlights “specific opportunities” within the existing data set. This is where creators can align their sound with market trends while maintaining their unique voice, allowing them to make more informed, data-driven decisions on what to promote.

While our data set spans from 1960 to 2020, the rise of social media, particularly since 2010, has significantly reshaped listening behavior through phenomena such as FOMO. By comparing song trends before and after the social media boom, we aim to examine how popularity-driven consumption has influenced musical characteristics, offering insights into how creators can adapt in a socially amplified market.

1.2 Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed).

We have the data set for Spotify from GITHUB. We are going to identify the variables that will correlate with our problem statement. Construct a visualization(ggplot, histogram, plot graphs) with these figures & process manipulation of data in such way to analyze the information were seeking. We will visualize the data in the form of graphs to determine the most popular music artists, as well as what the audience is looking for in music: danceability, liveness, and energy as our dimensions.

1.3 Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem.

Regression analysis allows us to explore the relationship between each variable and a song’s popularity, helping identify which features have the strongest impact.

  • Audio features ↔︎ popularity
  • Genre ↔︎ popularity
  • Artists ↔︎ popularity
  • Release timing (seasonal effect) ↔︎ popularity

1.4 Explain how your analysis will help the consumer of your analysis.

Artists and producers make strategic choices to increase the reach of their music, while talent scouts identify artists with high commercial potential. Even music lovers discover hidden gems that have long been overlooked in their playlists.

Packages Required

2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis.

2.2 Messages and warnings resulting from loading the package are suppressed.

2.3 Explanation is provided regarding the purpose of each package (there are over 10,000 packages, don’t assume that I know why you loaded each package).

More “packages” can be added in the future:

  • library(tidyverse) - A comprehensive toolkit for data science workflows, including data import, cleaning, transformation, visualization, and integration.
  • library(dplyr) - Used for data manipulation.
  • library(tidyr) - Reshaping and organizing data.
  • library(ggplot2) - Create beautiful, flexible plots.
  • library(lubridate) - Work with dates and times.
  • library(knitr) - Dynamic Report Generation.
  • library(kableExtra) - Enhanced Table Styling

Data Preparation

3.1 Original source where the data was obtained is cited and, if possible, hyperlinked.

  • We will use the Spotify data set from the course material, named “spotify_songs.csv”, or tidytuesday from GitHub.

3.2 Source data is thoroughly explained (i.e. what was the original purpose of the data, when was it collected, how many variables did the original have, explain any peculiarities of the source data such as how missing values are recorded, or how data was imputed, etc.).

  • Origin: Part of the TidyTuesday weekly data project for practicing R skills.
  • Purpose: Designed to help users learn data wrangling and visualization using ggplot2, dplyr, tidyr, and other tidyverse tools.
  • Community: Created by members of the R4DS Online Learning Community, inspired by the “R for Data Science” textbook.
  • Source: Data collected from Spotify via the spotifyr package.
  • Date Created: January 21, 2020.
  • Authors: Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff.
  • Size: 32,833 records and 23 variables.
  • Content: Includes track metadata (e.g., artist, album, genre) and musical features (e.g., danceability, energy, valence).
  • Missing Values: Recorded as NA; no imputation was applied.
  • Use Case: Ideal for exploratory analysis, genre comparison, and building visualizations.

3.3 Data importing and cleaning steps are explained in the text (tell me why you are doing the data cleaning activities that you perform) and follow a logical process.

3.3.1 Import the dataset & view structure into Rstudio:

Import the data set into Rstudio:

spotify <- read.csv("C:/Users/samc8/OneDrive - Xavier University/Data Wrangling/Week 4/spotify_songs (2).csv")

View structure of the data set:

str(spotify) 

View summary statistics of the data set:

summary(spotify)

3.3.2 Deleted Unused Variables:

  • playlist_subgenre: Contains 24 distinct sub-genres that introduce noise and fragmentation. Removed to avoid overfitting or misleading groupings in genre-based analysis.
  • playlist_id: Unique identifier with no analytical value. Dropped to reduce dimensionality and avoid clutter.
  • track_album_id / track_id: Technical identifiers used for database referencing, not meaningful for visualization or modeling.

Removed to streamline the data set.

spotify$playlist_id <- NULL
spotify$track_album_id <- NULL
spotify$track_id <- NULL
spotify$playlist_subgenre <- NULL

3.3.3 Renaming Variables

colnames(spotify) <- c("track_name", "track_artist", "track_popularity", "track_album_name",
                       "track_album_release_date", "playlist_name", "playlist_genre", "danceability",
                       "energy", "key", "loudness", "mode", "speech_ratio",
                       "acousticness", "instrumentalness", "liveness", "positivity",
                       "tempo", "duration_ms")
  • Renamed “speechiness” to speech_ratio to clarify that the variable reflects the proportion of spoken content in a track.
  • Renamed “valence” to positivity to make the emotional tone more intuitive and easier to interpret.

3.3.4 Issues with missing values

colSums(is.na(spotify))
Find Missing Values:
  • There are 15 missing values
    • 5 missing values from track_name
    • 5 missing values from track_artist
    • 5 missing values from track_album_name
Remove Missing Values:
spotify_clean <- na.omit(spotify)

3.3.5 - Conversion of date format.

We wanted to convert the release date of the track album into a proper date format.

spotify_clean <- spotify_clean %>%
  mutate(track_album_release_date = as.Date(track_album_release_date))
  
str(spotify_clean$track_album_release_date)

Alternative (other code to consider):

spotify_clean <- spotify_clean %>%
  mutate(track_album_release_date = as.Date(track_album_release_date))

3.3.6 - Creation of new dimension.

To create new dimensions for the following variables:
  • release_year - Created the “release_year” column to enable year-based analysis of track trends, allowing for easier aggregation and comparison over time.
  • duration_min - Since song duration was originally stored in milliseconds, we created a new variable “duration_min” to express it in minutes, making comparisons and visualizations more intuitive.
spotify_clean <- spotify_clean %>%
  mutate(
    release_year = lubridate::year(track_album_release_date),
    duration_min = duration_ms / 60000, )

3.3.7.1 - Finding Outiers.

boxplot(spotify_clean$duration_min,
        main = "Boxplot of Song Duration (min)",
        ylab = "Duration (minutes)")

3.3.7.2 - Removal of Outiers.

To avoid excluding valid songs with unusually long or short durations, we apply an asymmetric threshold: 4 × IQR above the third quartile and 2 × IQR below the first quartile. This approach broadens the acceptable range while still filtering extreme values, helping preserve meaningful variation in the data set without misclassifying legitimate entries as outliers.

Q1 <- quantile(spotify_clean$duration_min, 0.25, na.rm = TRUE)
Q3 <- quantile(spotify_clean$duration_min, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
upper_bound <- Q3 + 4 * IQR
lower_bound <- Q1 - 2 * IQR
spotify_clean_2 <- spotify_clean[
  spotify_clean$duration_min >= lower_bound & spotify_clean$duration_min <= upper_bound, ]
boxplot(spotify_clean_2$duration_min,
        main = "Boxplot of Song Duration (min, no outliers)",
        ylab = "Duration (minutes)")
length(spotify_clean_2$duration_min)

After data cleaning there were 32 songs that were defined as outliers and removed from the data set.

  • Original data set: 32833 observations.
  • Cleaned data set: 32801 observations.
Boxplot of Song Duration
Boxplot of Song Duration
Description:
  • Interquartile ranges helps to find the outlier by providing a clear picture of the data’s spread or also known as midspread. Accessing variability in the data and understanding the distribution of the whole data set.

3.3.8 - Grouping of artist and calculation to the total popularity.

Grouped tracks by artist and calculated the “average popularity” across their songs to represent each artist’s overall popularity. Then extracted the top 10 artists with the highest overall popularity for focused analysis.

artist_popularity <- spotify_clean %>%
  group_by(track_artist) %>%                    
  summarise(total_popularity = sum(track_popularity, na.rm = TRUE),
            avg_popularity = mean(track_popularity, na.rm = TRUE),
            song_count = n()) %>%                
  arrange(desc(total_popularity))

To select and slice the top 10 rows of the artists in the data set.

top10 <- artist_popularity %>% slice_head(n = 10)

Visualization of the distribution using “ggplot” by top 10.

ggplot(top10, aes(x = reorder(track_artist, total_popularity),
                  y = total_popularity)) +
  geom_col(fill = "#1DB954") +
  coord_flip() +
  labs(title = "Top 10 Artists by Total Popularity",
       x = NULL, y = "Total Popularity") +
  theme_minimal()
Top 10 Artists by Total Popularity
Top 10 Artists by Total Popularity

3.3.9. Convert into dummy variables

At this stage, categorical variables such as genre have not been converted into dummy variables, as the current analysis does not require it. This transformation may be considered in future modeling steps if needed.

3.4 Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible.

kableExtra::scroll_box(
  kableExtra::kable_paper(
    kableExtra::kbl(head(spotify_clean_2, 10))
  ),
  width = "700px",
  height = "300px"
)
track_name track_artist track_popularity track_album_name track_album_release_date playlist_name playlist_genre danceability energy key loudness mode speech_ratio acousticness instrumentalness liveness positivity tempo duration_ms release_year duration_min
I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754 2019 3.245900
Memories - Dillon Francis Remix Maroon 5 67 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600 2019 2.710000
All the Time - Don Diablo Remix Zara Larsson 70 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616 2019 2.943600
Call You Mine - Keanu Silva Remix The Chainsmokers 60 Call You Mine - The Remixes 2019-07-19 Pop Remix pop 0.718 0.930 7 -3.778 1 0.1020 0.0287 9.40e-06 0.2040 0.277 121.956 169093 2019 2.818217
Someone You Loved - Future Humans Remix Lewis Capaldi 69 Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix pop 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976 189052 2019 3.150867
Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran 67 Beautiful People (feat. Khalid) [Jack Wins Remix] 2019-07-11 Pop Remix pop 0.675 0.919 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982 163049 2019 2.717483
Never Really Over - R3HAB Remix Katy Perry 62 Never Really Over (R3HAB Remix) 2019-07-26 Pop Remix pop 0.449 0.856 5 -4.788 0 0.0623 0.1870 0.00e+00 0.1760 0.152 112.648 187675 2019 3.127917
Post Malone (feat. RANI) - GATTÜSO Remix Sam Feldt 69 Post Malone (feat. RANI) [GATTÜSO Remix] 2019-08-29 Pop Remix pop 0.542 0.903 4 -2.419 0 0.0434 0.0335 4.80e-06 0.1110 0.367 127.936 207619 2019 3.460317
Tough Love - Tiësto Remix / Radio Edit Avicii 68 Tough Love (Tiësto Remix) 2019-06-14 Pop Remix pop 0.594 0.935 8 -3.562 1 0.0565 0.0249 4.00e-06 0.6370 0.366 127.015 193187 2019 3.219783
If I Can’t Have You - Gryffin Remix Shawn Mendes 67 If I Can’t Have You (Gryffin Remix) 2019-06-20 Pop Remix pop 0.642 0.818 2 -4.552 1 0.0320 0.0567 0.00e+00 0.0919 0.590 124.957 253040 2019 4.217333

3.5 Provide summary information about the variables of concern in your cleaned data set. Do not just print off a bunch of code chunks with str(), summary(), etc. Rather, provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.

summary(spotify_clean_2[,10:19])

table(spotify_clean_2$release_year)

Proposed - Exploratory Data Analysis (EDA)

4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?

Based on the popularity of the top artists

  1. Does the artist’s popularity affect the music’s popularity?

  2. Do audio features (danceability, energy, valence, tempo, loudness, etc.) significantly relate to popularity?

  3. Which audio features (danceability, energy, valence, tempo, loudness, etc.) have the strongest relationship with popularity?

  4. Do instrumental or acoustic songs perform worse than vocal or electronic songs?

  5. Does speechiness (rap-like lyrics) correlate positively or negatively with popularity?

  6. Does release timing (year) influence popularity? (based on two different time periods: the difference between 2010-2020 and 2010 backward into the past.)?

4.2 What types of plots and tables will help you to illustrate the findings to your questions?

We plan to utilize various tables and plots to visualize our data.

  • We would like to use a “Bar Chart” to rank the top 10 artists by total popularity. We can also visualize the top 10 least popular artists in the same fashion.

  • We will also utilize other charts such as “Correlation Plot” to see similar attributes within the data set. “Boxplots” to visualize outliers and visualize two factors being compared to one another. Finally, we plan on utilizing “Heat Maps” as a visualization, where the color of each variable will stand out compared to other attributes within the data. This will easily distinguish the attributes with one another.

4.3 What do you not know how to do right now that you need to learn to answer your questions?

  • We want to learn how to use correlation plots and heat maps to better visualize relationships between musical features and song popularity.

  • Handle year and time-related variables effectively, whether to group songs by decade, segment by pre/post social media era, or account for delayed popularity trends.

  • Apply clustering methods (e.g., K-means or hierarchical clustering) to group songs based on musical characteristics.

  • Implement Principal Component Analysis (PCA) to reduce dimensionality and visualize the underlying structure of our data set.

  • We are unsure how to measure artist popularity without bias from song count. We need to learn how to design a fair composite or normalized metric.

4.4 Do you plan on incorporating any machine learning techniques (i.e. linear regression, discriminant analysis, cluster analysis) to answer your questions?

  • We plan to use linear regression, to identify the key factors that influence a song’s popularity, both across different time periods and specifically within the top 10 most popular artists. This focus allows us to uncover general trends in musical success while also analyzing the unique characteristics and strategies of leading artists.

  • Linear regression provides an interpretable coefficient from data involving statistical tests and intervals. Additionally, we can alternatively incorporate other ML algorithms such as K-means, linear vectors, and decision trees to predict and influence consumer’s taste in music.